Randomly run evals based on record_id hash (#850)

* prototype * update md * fix run_feedback * update * update * update
truera · bodhisaha · Dec 20, 2023 · Dec 20, 2023 · Dec 21, 2023 · Dec 22, 2023
commit 29d93a80e4408ccdd7d807d38c109760f171fd82
diff --git a/docs/trulens_eval/feedback_functions_existing_data.md b/docs/trulens_eval/feedback_functions_existing_data.md
@@ -14,15 +14,11 @@ feedback_result = provider.relevance("<some prompt>", "<some response>")
 In the case that you have already logged a run of your application with TruLens and have the record available, the process for running an (additional) evaluation on that record is by using `tru.run_feedback_functions`:
 
 ```python
-tru_recorder = TruChain(
-    chain,
-    app_id='Chain1_ChatApplication'
-)
-
-with tru_recorder as recording:
-    record = chain(""What is langchain?")
+tru_rag = TruCustomApp(rag, app_id = 'RAG v1')
 
-tru.run_feedbacks(record, feedbacks=[f_lang_match, f_qa_relevance, f_context_relevance])
+result, record = tru_rag.with_record(rag.query, "How many professors are at UW in Seattle?")
+feedback_results = tru.run_feedback_functions(record, feedbacks=[f_lang_match, f_qa_relevance, f_context_relevance])
+tru.add_feedbacks(feedback_results)
 ```
 
 ### TruVirtual

diff --git a/trulens_eval/examples/experimental/random_evaluation.ipynb b/trulens_eval/examples/experimental/random_evaluation.ipynb
@@ -0,0 +1,390 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Random Evaluation of Records\n",
+    "\n",
+    "This notebook walks through the random evaluation of records with TruLens.\n",
+    "\n",
+    "This is useful in cases where we want to log all application runs, but it is expensive to run evaluations each time. To gauge the performance of the app, we need *some* evaluations, so it is useful to evaluate a representative sample of records. We can do this after each record selectively running and logging feedback based on some randomization scheme.\n",
+    "\n",
+    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/truera/trulens/blob/main/trulens_eval/examples/experimental/random_evaluation.ipynb)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# ! pip install trulens_eval==0.22.0 chromadb==0.4.18 openai==1.3.7"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "os.environ[\"OPENAI_API_KEY\"] = \"sk-...\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Get Data\n",
+    "\n",
+    "In this case, we'll just initialize some simple text in the notebook."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "university_info = \"\"\"\n",
+    "The University of Washington, founded in 1861 in Seattle, is a public research university\n",
+    "with over 45,000 students across three campuses in Seattle, Tacoma, and Bothell.\n",
+    "As the flagship institution of the six public universities in Washington state,\n",
+    "UW encompasses over 500 buildings and 20 million square feet of space,\n",
+    "including one of the largest library systems in the world.\n",
+    "\"\"\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Create Vector Store\n",
+    "\n",
+    "Create a chromadb vector store in memory."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from openai import OpenAI\n",
+    "oai_client = OpenAI()\n",
+    "\n",
+    "oai_client.embeddings.create(\n",
+    "        model=\"text-embedding-ada-002\",\n",
+    "        input=university_info\n",
+    "    )"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import chromadb\n",
+    "from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction\n",
+    "\n",
+    "embedding_function = OpenAIEmbeddingFunction(api_key=os.environ.get('OPENAI_API_KEY'),\n",
+    "                                             model_name=\"text-embedding-ada-002\")\n",
+    "\n",
+    "chroma_client = chromadb.Client()\n",
+    "vector_store = chroma_client.get_or_create_collection(name=\"Universities\",\n",
+    "                                                      embedding_function=embedding_function)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "collapsed": false
+   },
+   "source": [
+    "Add the university_info to the embedding database."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "vector_store.add(\"uni_info\", documents=university_info)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Build RAG from scratch\n",
+    "\n",
+    "Build a custom RAG from scratch, and add TruLens custom instrumentation."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from trulens_eval import Tru\n",
+    "from trulens_eval.tru_custom_app import instrument\n",
+    "tru = Tru()\n",
+    "tru.reset_database()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "class RAG_from_scratch:\n",
+    "    @instrument\n",
+    "    def retrieve(self, query: str) -> list:\n",
+    "        \"\"\"\n",
+    "        Retrieve relevant text from vector store.\n",
+    "        \"\"\"\n",
+    "        results = vector_store.query(\n",
+    "        query_texts=query,\n",
+    "        n_results=2\n",
+    "    )\n",
+    "        return results['documents'][0]\n",
+    "\n",
+    "    @instrument\n",
+    "    def generate_completion(self, query: str, context_str: list) -> str:\n",
+    "        \"\"\"\n",
+    "        Generate answer from context.\n",
+    "        \"\"\"\n",
+    "        completion = oai_client.chat.completions.create(\n",
+    "        model=\"gpt-3.5-turbo\",\n",
+    "        temperature=0,\n",
+    "        messages=\n",
+    "        [\n",
+    "            {\"role\": \"user\",\n",
+    "            \"content\": \n",
+    "            f\"We have provided context information below. \\n\"\n",
+    "            f\"---------------------\\n\"\n",
+    "            f\"{context_str}\"\n",
+    "            f\"\\n---------------------\\n\"\n",
+    "            f\"Given this information, please answer the question: {query}\"\n",
+    "            }\n",
+    "        ]\n",
+    "        ).choices[0].message.content\n",
+    "        return completion\n",
+    "\n",
+    "    @instrument\n",
+    "    def query(self, query: str) -> str:\n",
+    "        context_str = self.retrieve(query)\n",
+    "        completion = self.generate_completion(query, context_str)\n",
+    "        return completion\n",
+    "\n",
+    "rag = RAG_from_scratch()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Set up feedback functions.\n",
+    "\n",
+    "Here we'll use groundedness, answer relevance and context relevance to detect hallucination."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from trulens_eval import Feedback, Select\n",
+    "from trulens_eval.feedback import Groundedness\n",
+    "from trulens_eval.feedback.provider.openai import OpenAI as fOpenAI\n",
+    "\n",
+    "import numpy as np\n",
+    "\n",
+    "# Initialize provider class\n",
+    "fopenai = fOpenAI()\n",
+    "\n",
+    "grounded = Groundedness(groundedness_provider=fopenai)\n",
+    "\n",
+    "# Define a groundedness feedback function\n",
+    "f_groundedness = (\n",
+    "    Feedback(grounded.groundedness_measure_with_cot_reasons, name = \"Groundedness\")\n",
+    "    .on(Select.RecordCalls.retrieve.rets.collect())\n",
+    "    .on_output()\n",
+    "    .aggregate(grounded.grounded_statements_aggregator)\n",
+    ")\n",
+    "\n",
+    "# Question/answer relevance between overall question and answer.\n",
+    "f_qa_relevance = (\n",
+    "    Feedback(fopenai.relevance_with_cot_reasons, name = \"Answer Relevance\")\n",
+    "    .on(Select.RecordCalls.retrieve.args.query)\n",
+    "    .on_output()\n",
+    ")\n",
+    "\n",
+    "# Question/statement relevance between question and each context chunk.\n",
+    "f_context_relevance = (\n",
+    "    Feedback(fopenai.qs_relevance_with_cot_reasons, name = \"Context Relevance\")\n",
+    "    .on(Select.RecordCalls.retrieve.args.query)\n",
+    "    .on(Select.RecordCalls.retrieve.rets.collect())\n",
+    "    .aggregate(np.mean)\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Construct the app\n",
+    "Wrap the custom RAG with TruCustomApp, add list of feedbacks for eval"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from trulens_eval import TruCustomApp\n",
+    "from trulens_eval import FeedbackMode\n",
+    "tru_rag = TruCustomApp(rag, app_id = 'RAG v1')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Eval Randomization\n",
+    "\n",
+    "Create a function to run feedback functions randomly, depending on the record_id hash"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import hashlib\n",
+    "import random\n",
+    "\n",
+    "from typing import Sequence, Iterable\n",
+    "from trulens_eval.schema import Record, FeedbackResult\n",
+    "from trulens_eval.feedback import Feedback\n",
+    "\n",
+    "def random_run_feedback_functions(\n",
+    "    record: Record,\n",
+    "    feedback_functions: Sequence[Feedback]\n",
+    "    ) -> Iterable[FeedbackResult]:\n",
+    "    \"\"\"\n",
+    "    Given the record, randomly decide to run feedback functions.\n",
+    "\n",
+    "    args:\n",
+    "    record (Record): The record on which to evaluate the feedback functions\n",
+    "\n",
+    "    feedback_functions (Sequence[Feedback]): A collection of feedback functions to evaluate.\n",
+    "\n",
+    "    returns:\n",
+    "    `FeedbackResult`, one for each element of `feedback_functions`, or prints \"Feedback skipped for this record\".\n",
+    "\n",
+    "    \"\"\"\n",
+    "    # randomly decide to run feedback (50% chance)\n",
+    "    decision = random.choice([True, False])\n",
+    "    # run feedback if decided\n",
+    "    if decision == True:\n",
+    "        print(\"Feedback run for this record\")\n",
+    "        tru.add_feedbacks(tru.run_feedback_functions(record, feedback_functions = [f_context_relevance, f_groundedness, f_qa_relevance]))\n",
+    "    else:\n",
+    "        print(\"Feedback skipped for this record\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Generate a test set"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from trulens_eval.generate_test_set import GenerateTestSet\n",
+    "test = GenerateTestSet(app_callable = rag.query)\n",
+    "test_set = test.generate_test_set(test_breadth = 4, test_depth = 1)\n",
+    "test_set"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Run the app\n",
+    "Run and log the rag applicaiton for each prompt in the test set. For a random subset of cases, also run evaluations."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# run feedback across test set\n",
+    "for category in test_set:\n",
+    "    # run prompts in each category\n",
+    "    test_prompts = test_set[category]\n",
+    "    for test_prompt in test_prompts:\n",
+    "        result, record = tru_rag.with_record(rag.query, \"How many professors are at UW in Seattle?\")\n",
+    "        # random run feedback based on record_id\n",
+    "        random_run_feedback_functions(record, feedback_functions = [f_context_relevance, f_groundedness, f_qa_relevance])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "tru.get_leaderboard(app_ids=[\"RAG v1\"])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "tru.run_dashboard()"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "trulens18_release",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.7"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}