opea-project · minmin-intel · Dec 11, 2024 · Dec 11, 2024 · Dec 12, 2024 · Dec 12, 2024
@@ -0,0 +1,8 @@
+# Benchmarks for agentic applications
+We collected two benchmarks for evaluating agentic applications:
+1. [CRAG](./crag_eval/README.md) (Comprehensive RAG) benchmark for RAG agents
+2. [TAG-Bench](./TAG-Bench/README.md) for SQL agents
+
+These agent benchmarks are enabled on Intel Gaudi systems using vllm as the LLM serving framework. You can choose to serve the models on other hardware with vllm too.
+
+We will add more benchmarks for agents in the future. Stay tuned.
@@ -0,0 +1,131 @@
+# TAG-Bench for evaluating SQL agents
+## Overview of TAG-Bench
+[TAG-Bench](https://github.com/TAG-Research/TAG-Bench) is a benchmark published in 2024 by Stanford University and University of California Berkeley and advocated by Databricks to evaluate GenAI systems in answering challenging questions over SQL databases. The questions in TAG-Bench require the GenAI systems to not only able to translate natural language queries into SQL queries, but to combine information from other sources and do reasoning. There are 80 questions in total with 20 in each sub-category of match-based, comparison, ranking, and aggregation queries. The questions are about 5 databases selected from Alibaba's [BIRD](https://bird-bench.github.io/) Text2SQL benchmark: california_schools, debit_card_specializing, formula_1, codebase_community, and european_football_2. For more information, please read the [paper](https://arxiv.org/pdf/2408.14717).
+
+## Getting started
+1. Set up the environment
+```bash
+export WORKDIR=<your-work-directory>
+mkdir $WORKDIR/hf_cache 
+export HF_CACHE_DIR=$WORKDIR/hf_cache
+export HF_HOME=$HF_CACHE_DIR
+export HF_TOKEN=<your-huggingface-api-token>
+export HUGGINGFACEHUB_API_TOKEN=$HF_TOKEN
+export PYTHONPATH=$PYTHONPATH:$WORKDIR/GenAIEval/
+```
+2. Download this repo in your work directory
+```bash
+cd $WORKDIR
+git clone https://github.com/opea-project/GenAIEval.git
+```
+3. Create a conda environment
+```bash
+conda create -n agent-eval-env python=3.10
+conda activate agent-eval-env
+pip install -r $WORKDIR/GenAIEval/evals/evaluation/agent_eval/docker/requirements.txt
+```
+4. Download data
+```bash
+cd $WORKDIR
+git clone https://github.com/TAG-Research/TAG-Bench.git
+cd TAG-Bench/setup
+chmod +x get_dbs.sh
+./get_dbs.sh
+```
+5. Preprocess data
+```bash
+cd $WORKDIR/GenAIEval/evals/evaluation/agent_eval/TAG-Bench/preprocess_data/
+bash run_data_split.sh
+```
+6. Generate hints file for each database in TAG-Bench
+```bash
+python3 generate_hints.py
+```
+The hints are generated from the description files that come with the TAG-Bench dataset. The hints are simply the column descriptions provided in the dataset. They can be used by the SQL agent to improve performance.
+
+7. Launch LLM endpoint on Gaudi.
+
+This LLM will be used by agent as well as used as LLM-judge in scoring agent's answers. By default, `meta-llama/Meta-Llama-3.1-70B-Instruct` model will be served using 4 Gaudi cards.
+```bash
+# First build vllm image for Gaudi
+cd $WORKDIR/GenAIEval/evals/evaluation/agent_eval/vllm-gaudi
+bash build_image.sh
+```
+Then launch vllm endpoint with the command below.
+```bash
+bash launch_vllm_gaudi.sh
+```
+
+8. Validate vllm endpoint is working properly.
+```bash
+python3 test_vllm_endpoint.py
+```
+
+## Launch your SQL agent
+You can create and launch your own SQL agent. Here we show an example of OPEA `sql_agent_llama`. Follow the steps below to launch OPEA `sql_agent_llama`.
+1. Download OPEA GenAIComps repo
+```bash
+cd $WORKDIR
+git clone https://github.com/opea-project/GenAIComps.git
+```
+2. Build docker image for OPEA agent
+```bash
+cd $WORKDIR/GenAIComps
+export agent_image="opea/agent:comps"
+docker build --no-cache -t $agent_image --build-arg http_proxy=$http_proxy --build-arg https_proxy=$https_proxy -f comps/agent/src/Dockerfile .
+``` 
+3. Set up environment for the `search_web` tool for agent.
+```bash
+export GOOGLE_CSE_ID=<your-GOOGLE_CSE_ID>
+export GOOGLE_API_KEY=<your-GOOGLE_API_KEY>
+```
+For instructions on how to obtain your `GOOGLE_CSE_ID` and `your-GOOGLE_API_KEY`, refer to instructions [here](https://python.langchain.com/docs/integrations/tools/google_search/).
+
+5. Launch SQL agent
+```bash
+cd $WORKDIR/GenAIEval/evals/evaluation/agent_eval/TAG-Bench/opea_sql_agent_llama
+bash launch_sql_agent.sh california_schools
+```
+The command above will launch a SQL agent that interacts with the `california_schools` database. We also have a script to run benchmarks on all databases.
+
+## Run the benchmark
+1. Generate answers
+```bash
+cd $WORKDIR/GenAIEval/evals/evaluation/agent_eval/TAG-Bench/run_benchmark
+bash run_generate_answer.sh california_schools
+```
+2. Grade the answers
+```bash
+bash run_grading.sh california_schools
+```
+Here we use ragas `answer_correctness` metric to measure the performance. By default, we use `meta-llama/Meta-Llama-3.1-70B-Instruct` as the LLM judge. We use the vllm endpoint launched in the previous [section](#launch-your-sql-agent).
+
+3. Run the benchmark on all databases
+
+If you want to run all the 80 questions spanning all the 5 different databases, run the command below.
+```bash
+bash run_all_databases.sh
+```
+This script will iteratively generate answers and grade answers for questions regarding each database.
+
+## Benchmark results
+We tested OPEA `sql_agent_llama` on all 80 questions in TAG-Bench. 
+
+Human grading criteria:
+- Score 1: exact match with golden answer
+- Score 0.7: match with golden answer except for the ordering of the entities
+- Score 0.5: missing info, and does not contain info not present in the golden answer
+- Score 0: otherwise
+
+|Database|Average human score|Average ragas `answer_correctness`|
+|--------|-------------------|----------------------------------|
+|california_schools|0.264|0.415|
+|codebase_community|0.262|0.404|
+|debit_card_specializing|0.75|0.753|
+|formula_1|0.389|0.596|
+|european_football_2|0.25|0.666|
+|**Overall Average (ours)**|0.31 (0.28 if strict exact match)|0.511|
+|**Text2SQL (TAG-Bench paper)**|0.17||
+|**Human performance (TAG-Bench paper)**|0.55||
+
+We can see that our SQL agent achieved much higher accuracy than Text2SQL, although still lower than human experts.
@@ -0,0 +1,38 @@
+#!/bin/bash
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+#set -xe
+echo $WORKDIR
+EVALDIR=${WORKDIR}/GenAIEval/evals/evaluation/agent_eval/TAG-Bench
+
+# agent vars
+export agent_image="opea/agent:comps"
+export recursion_limit=15
+
+# LLM endpoint
+export ip_address=$(hostname -I | awk '{print $1}')
+vllm_port=8085
+export HUGGINGFACEHUB_API_TOKEN=${HF_TOKEN}
+export LLM_MODEL_ID="meta-llama/Meta-Llama-3.1-70B-Instruct"
+export LLM_ENDPOINT_URL="http://${ip_address}:${vllm_port}"
+echo "LLM_ENDPOINT_URL=${LLM_ENDPOINT_URL}"
+export temperature=0.01
+export max_new_tokens=4096
+
+# Tools
+export TOOLSET_PATH=${EVALDIR}/opea_sql_agent_llama/tools/
+ls ${TOOLSET_PATH}
+# for using Google search API
+export GOOGLE_CSE_ID=${GOOGLE_CSE_ID}
+export GOOGLE_API_KEY=${GOOGLE_API_KEY}
+
+function start_sql_agent_llama_service(){
+    export db_name=$1
+    export db_path="sqlite:////home/user/TAG-Bench/dev_folder/dev_databases/${db_name}/${db_name}.sqlite"
+    docker compose -f ${EVALDIR}/opea_sql_agent_llama/sql_agent_llama.yaml up -d
+    # sleep 1m
+}
+
+db_name=$1
+start_sql_agent_llama_service $db_name
@@ -0,0 +1,38 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+services:
+  agent:
+    image: ${agent_image}
+    container_name: sql-agent-endpoint
+    volumes:
+      - ${TOOLSET_PATH}:/home/user/tools/ # tools
+      # - ${WORKDIR}/GenAIComps/comps:/home/user/comps # code
+      - ${WORKDIR}/TAG-Bench/:/home/user/TAG-Bench # SQL database and hints_file
+    ports:
+      - "9095:9095"
+    ipc: host
+    environment:
+      ip_address: ${ip_address}
+      strategy: sql_agent_llama
+      with_memory: false
+      db_name: ${db_name}
+      db_path: ${db_path}
+      use_hints: true
+      hints_file: /home/user/TAG-Bench/${db_name}_hints.csv
+      recursion_limit: ${recursion_limit}
+      llm_engine: vllm
+      HUGGINGFACEHUB_API_TOKEN: ${HUGGINGFACEHUB_API_TOKEN}
+      llm_endpoint_url: ${LLM_ENDPOINT_URL}
+      model: ${LLM_MODEL_ID}
+      temperature: ${temperature}
+      max_new_tokens: ${max_new_tokens}
+      stream: false
+      tools: /home/user/tools/sql_agent_tools.yaml
+      require_human_feedback: false
+      no_proxy: ${no_proxy}
+      http_proxy: ${http_proxy}
+      https_proxy: ${https_proxy}
+      port: 9095
+      GOOGLE_CSE_ID: ${GOOGLE_CSE_ID} #delete
+      GOOGLE_API_KEY: ${GOOGLE_API_KEY} # delete
@@ -0,0 +1,19 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+
+def search_web(query: str) -> str:
+    """Search the web for information not contained in databases."""
+    from langchain_core.tools import Tool
+    from langchain_google_community import GoogleSearchAPIWrapper
+
+    search = GoogleSearchAPIWrapper()
+
+    tool = Tool(
+        name="google_search",
+        description="Search Google for recent results.",
+        func=search.run,
+    )
+
+    response = tool.run(query)
+    return response
@@ -0,0 +1,11 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+search_web:
+  description: Search the web for a given query.
+  callable_api: sql_agent_tools.py:search_web
+  args_schema:
+    query:
+      type: str
+      description: query
+  return_output: retrieved_data
@@ -0,0 +1,55 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+import glob
+import os
+
+import pandas as pd
+
+
+def generate_column_descriptions(db_name):
+    descriptions = []
+    working_dir = os.getenv("WORKDIR")
+    assert working_dir is not None, "WORKDIR environment variable is not set."
+    DESCRIPTION_FOLDER = os.path.join(
+        working_dir, f"TAG-Bench/dev_folder/dev_databases/{db_name}/database_description/"
+    )
+    table_files = glob.glob(os.path.join(DESCRIPTION_FOLDER, "*.csv"))
+    table_name_col = []
+    col_name_col = []
+    for table_file in table_files:
+
+        table_name = os.path.basename(table_file).split(".")[0]
+        print("Table name: ", table_name)
+        df = pd.read_csv(table_file, encoding_errors="ignore")
+        for _, row in df.iterrows():
+            col_name = row["original_column_name"]
+            if not pd.isnull(row["value_description"]):
+                description = str(row["value_description"])
+                if description.lower() in col_name.lower():
+                    print("Description {} is same as column name {}".format(description, col_name))
+                    pass
+                else:
+                    description = description.replace("\n", " ")
+                    description = " ".join(description.split())
+                    descriptions.append(description)
+                    table_name_col.append(table_name)
+                    col_name_col.append(col_name)
+        # except Exception as e:
+        #     print("Error in processing: ", table_file)
+        #     print("Error is ", e)
+    hints_df = pd.DataFrame({"table_name": table_name_col, "column_name": col_name_col, "description": descriptions})
+    tag_bench_dir = os.path.join(working_dir, "TAG-Bench")
+    output_file = os.path.join(tag_bench_dir, f"{db_name}_hints.csv")
+    hints_df.to_csv(output_file, index=False)
+    print(f"Generated hints file: {output_file}")
+
+
+if __name__ == "__main__":
+    tag_bench_dir = os.path.join(os.getenv("WORKDIR"), "TAG-Bench/dev_folder/dev_databases/")
+    subfolders = [f.name for f in os.scandir(tag_bench_dir) if f.is_dir()]
+    print("Databases: ", subfolders)
+    for db_name in subfolders:
+        print("Generating hints for database: ", db_name)
+        generate_column_descriptions(db_name)
+        print("=" * 30)
@@ -0,0 +1,6 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+DATAPATH=$WORKDIR/TAG-Bench/tag_queries.csv
+OUTFOLDER=$WORKDIR/TAG-Bench/query_by_db
+python3 split_data.py --path $DATAPATH --output $OUTFOLDER
@@ -0,0 +1,27 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+import argparse
+import os
+
+import pandas as pd
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--path", type=str, required=True)
+    parser.add_argument("--output", type=str, required=True)
+    args = parser.parse_args()
+
+    # if output folder does not exist, create it
+    if not os.path.exists(args.output):
+        os.makedirs(args.output)
+
+    # Load the data
+    data = pd.read_csv(args.path)
+
+    # Split the data by domain
+    domains = data["DB used"].unique()
+    for domain in domains:
+        domain_data = data[data["DB used"] == domain]
+        out = os.path.join(args.output, f"query_{domain}.csv")
+        domain_data.to_csv(out, index=False)
@@ -0,0 +1,26 @@
+# Copyright (C) 2025 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+import argparse
+import glob
+import os
+
+import pandas as pd
+
+parser = argparse.ArgumentParser()
+parser.add_argument("--filedir", type=str, required=True, help="Directory containing the csv files")
+args = parser.parse_args()
+
+filedir = args.filedir
+csv_files = glob.glob(os.path.join(filedir, "*_graded.csv"))
+print("Number of score files found: ", len(csv_files))
+print(csv_files)
+
+df = pd.concat([pd.read_csv(f) for f in csv_files], ignore_index=True)
+print(df.columns)
+print("Average score of all questions: ", df["answer_correctness"].mean())
+
+# average score per csv file
+for f in csv_files:
+    df = pd.read_csv(f)
+    print(f, df["answer_correctness"].mean())