docs: update some retrievers how-to guides (langchain-ai#21607)

octoml · May 13, 2024 · b0f5a47 · b0f5a47
1 parent 480c02b
commit b0f5a47
Show file tree

Hide file tree

Showing 4 changed files with 849 additions and 142 deletions.
diff --git a/docs/docs/how_to/MultiQueryRetriever.ipynb b/docs/docs/how_to/MultiQueryRetriever.ipynb
@@ -7,14 +7,16 @@
    "source": [
     "# How to use the MultiQueryRetriever\n",
     "\n",
-    "Distance-based vector database retrieval embeds (represents) queries in high-dimensional space and finds similar embedded documents based on \"distance\". But, retrieval may produce different results with subtle changes in query wording or if the embeddings do not capture the semantics of the data well. Prompt engineering / tuning is sometimes done to manually address these problems, but can be tedious.\n",
+    "Distance-based vector database retrieval embeds (represents) queries in high-dimensional space and finds similar embedded documents based on a distance metric. But, retrieval may produce different results with subtle changes in query wording, or if the embeddings do not capture the semantics of the data well. Prompt engineering / tuning is sometimes done to manually address these problems, but can be tedious.\n",
     "\n",
-    "The `MultiQueryRetriever` automates the process of prompt tuning by using an LLM to generate multiple queries from different perspectives for a given user input query. For each query, it retrieves a set of relevant documents and takes the unique union across all queries to get a larger set of potentially relevant documents. By generating multiple perspectives on the same question, the `MultiQueryRetriever` might be able to overcome some of the limitations of the distance-based retrieval and get a richer set of results."
+    "The [MultiQueryRetriever](https://api.python.langchain.com/en/latest/retrievers/langchain.retrievers.multi_query.MultiQueryRetriever.html) automates the process of prompt tuning by using an LLM to generate multiple queries from different perspectives for a given user input query. For each query, it retrieves a set of relevant documents and takes the unique union across all queries to get a larger set of potentially relevant documents. By generating multiple perspectives on the same question, the `MultiQueryRetriever` can mitigate some of the limitations of the distance-based retrieval and get a richer set of results.\n",
+    "\n",
+    "Let's build a vectorstore using the [LLM Powered Autonomous Agents](https://lilianweng.github.io/posts/2023-06-23-agent/) blog post by Lilian Weng from the [RAG tutorial](/docs/tutorials/rag):"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 2,
+   "execution_count": 1,
    "id": "994d6c74",
    "metadata": {},
    "outputs": [],
@@ -50,7 +52,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 3,
+   "execution_count": 2,
    "id": "edbca101",
    "metadata": {},
    "outputs": [],
@@ -67,7 +69,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 4,
+   "execution_count": 3,
    "id": "9e6d3b69",
    "metadata": {},
    "outputs": [],
@@ -81,15 +83,15 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 5,
-   "id": "e5203612",
+   "execution_count": 4,
+   "id": "bc93dc2b-9407-48b0-9f9a-338247e7eb69",
    "metadata": {},
    "outputs": [
     {
      "name": "stderr",
      "output_type": "stream",
      "text": [
-      "INFO:langchain.retrievers.multi_query:Generated queries: ['1. How can Task Decomposition be approached?', '2. What are the different methods for Task Decomposition?', '3. What are the various approaches to decomposing tasks?']\n"
+      "INFO:langchain.retrievers.multi_query:Generated queries: ['1. How can Task Decomposition be achieved through different methods?', '2. What strategies are commonly used for Task Decomposition?', '3. What are the various techniques for breaking down tasks in Task Decomposition?']\n"
      ]
     },
     {
@@ -98,54 +100,70 @@
        "5"
       ]
      },
-     "execution_count": 5,
+     "execution_count": 4,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
-    "unique_docs = retriever_from_llm.get_relevant_documents(query=question)\n",
+    "unique_docs = retriever_from_llm.invoke(question)\n",
     "len(unique_docs)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "7e170263-facd-4065-bb68-d11fb9123a45",
+   "metadata": {},
+   "source": [
+    "Note that the underlying queries generated by the retriever are logged at the `INFO` level."
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "c54a282f",
    "metadata": {},
    "source": [
     "#### Supplying your own prompt\n",
     "\n",
-    "You can also supply a prompt along with an output parser to split the results into a list of queries."
+    "Under the hood, `MultiQueryRetriever` generates queries using a specific [prompt](https://api.python.langchain.com/en/latest/_modules/langchain/retrievers/multi_query.html#MultiQueryRetriever). To customize this prompt:\n",
+    "\n",
+    "1. Make a [PromptTemplate](https://api.python.langchain.com/en/latest/prompts/langchain_core.prompts.prompt.PromptTemplate.html) with an input variable for the question;\n",
+    "2. Implement an [output parser](/docs/concepts#output-parsers) like the one below to split the result into a list of queries.\n",
+    "\n",
+    "The prompt and output parser together must support the generation of a list of queries."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 6,
+   "execution_count": 5,
    "id": "d9afb0ca",
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/Users/chestercurme/.pyenv/versions/3.10.4/envs/sandbox310/lib/python3.10/site-packages/langchain_core/_api/deprecation.py:119: LangChainDeprecationWarning: The class `LLMChain` was deprecated in LangChain 0.1.17 and will be removed in 0.3.0. Use RunnableSequence, e.g., `prompt | llm` instead.\n",
+      "  warn_deprecated(\n"
+     ]
+    }
+   ],
    "source": [
     "from typing import List\n",
     "\n",
     "from langchain.chains import LLMChain\n",
-    "from langchain.output_parsers import PydanticOutputParser\n",
+    "from langchain_core.output_parsers import BaseOutputParser\n",
     "from langchain_core.prompts import PromptTemplate\n",
-    "from pydantic import BaseModel, Field\n",
+    "from langchain_core.pydantic_v1 import BaseModel, Field\n",
     "\n",
     "\n",
     "# Output parser will split the LLM result into a list of queries\n",
-    "class LineList(BaseModel):\n",
-    "    # \"lines\" is the key (attribute name) of the parsed output\n",
-    "    lines: List[str] = Field(description=\"Lines of text\")\n",
-    "\n",
+    "class LineListOutputParser(BaseOutputParser[List[str]]):\n",
+    "    \"\"\"Output parser for a list of lines.\"\"\"\n",
     "\n",
-    "class LineListOutputParser(PydanticOutputParser):\n",
-    "    def __init__(self) -> None:\n",
-    "        super().__init__(pydantic_object=LineList)\n",
-    "\n",
-    "    def parse(self, text: str) -> LineList:\n",
+    "    def parse(self, text: str) -> List[str]:\n",
     "        lines = text.strip().split(\"\\n\")\n",
-    "        return LineList(lines=lines)\n",
+    "        return lines\n",
     "\n",
     "\n",
     "output_parser = LineListOutputParser()\n",
@@ -170,24 +188,24 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 7,
-   "id": "6660d7ee",
+   "execution_count": 6,
+   "id": "2eca2d96-8057-4ed9-873d-fa1064c09acf",
    "metadata": {},
    "outputs": [
     {
      "name": "stderr",
      "output_type": "stream",
      "text": [
-      "INFO:langchain.retrievers.multi_query:Generated queries: [\"1. What is the course's perspective on regression?\", '2. Can you provide information on regression as discussed in the course?', '3. How does the course cover the topic of regression?', \"4. What are the course's teachings on regression?\", '5. In relation to the course, what is mentioned about regression?']\n"
+      "INFO:langchain.retrievers.multi_query:Generated queries: ['1. Can you provide insights on regression from the course material?', '2. How is regression discussed in the course content?', '3. What information does the course offer about regression analysis?', '4. What are the teachings of the course regarding regression?', '5. In what manner is regression covered in the course curriculum?']\n"
      ]
     },
     {
      "data": {
       "text/plain": [
-       "11"
+       "9"
       ]
      },
-     "execution_count": 7,
+     "execution_count": 6,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -199,9 +217,7 @@
     ")  # \"lines\" is the key (attribute name) of the parsed output\n",
     "\n",
     "# Results\n",
-    "unique_docs = retriever.get_relevant_documents(\n",
-    "    query=\"What does the course say about regression?\"\n",
-    ")\n",
+    "unique_docs = retriever.invoke(\"What does the course say about regression?\")\n",
     "len(unique_docs)"
    ]
   }
@@ -222,7 +238,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.10.1"
+   "version": "3.10.4"
   }
  },
  "nbformat": 4,

diff --git a/docs/docs/how_to/contextual_compression.ipynb b/docs/docs/how_to/contextual_compression.ipynb
@@ -12,13 +12,12 @@
     "Contextual compression is meant to fix this. The idea is simple: instead of immediately returning retrieved documents as-is, you can compress them using the context of the given query, so that only the relevant information is returned. “Compressing” here refers to both compressing the contents of an individual document and filtering out documents wholesale.\n",
     "\n",
     "To use the Contextual Compression Retriever, you'll need:\n",
+    "\n",
     "- a base retriever\n",
     "- a Document Compressor\n",
     "\n",
     "The Contextual Compression Retriever passes queries to the base retriever, takes the initial documents and passes them through the Document Compressor. The Document Compressor takes a list of documents and shortens it by reducing the contents of documents or dropping documents altogether.\n",
     "\n",
-    "![](https://drive.google.com/uc?id=1CtNgWODXZudxAWSRiWgSGEoTNrUFT98v)\n",
-    "\n",
     "## Get started"
    ]
   },
@@ -51,8 +50,8 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 3,
-   "id": "2b0be066",
+   "execution_count": 2,
+   "id": "25c26947-958d-4219-8ca0-daa3a51bd344",
    "metadata": {},
    "outputs": [
     {
@@ -123,14 +122,12 @@
     "from langchain_openai import OpenAIEmbeddings\n",
     "from langchain_text_splitters import CharacterTextSplitter\n",
     "\n",
-    "documents = TextLoader(\"../../state_of_the_union.txt\").load()\n",
+    "documents = TextLoader(\"state_of_the_union.txt\").load()\n",
     "text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
     "texts = text_splitter.split_documents(documents)\n",
     "retriever = FAISS.from_documents(texts, OpenAIEmbeddings()).as_retriever()\n",
     "\n",
-    "docs = retriever.get_relevant_documents(\n",
-    "    \"What did the president say about Ketanji Brown Jackson\"\n",
-    ")\n",
+    "docs = retriever.invoke(\"What did the president say about Ketanji Brown Jackson\")\n",
     "pretty_print_docs(docs)"
    ]
   },
@@ -145,24 +142,10 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 4,
-   "id": "f08d19e6",
+   "execution_count": 3,
+   "id": "d83e3c63-bcde-43e9-998e-35bf2ebef49b",
    "metadata": {},
    "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "/Users/harrisonchase/workplace/langchain/libs/langchain/langchain/chains/llm.py:316: UserWarning: The predict_and_parse method is deprecated, instead pass an output parser directly to LLMChain.\n",
-      "  warnings.warn(\n",
-      "/Users/harrisonchase/workplace/langchain/libs/langchain/langchain/chains/llm.py:316: UserWarning: The predict_and_parse method is deprecated, instead pass an output parser directly to LLMChain.\n",
-      "  warnings.warn(\n",
-      "/Users/harrisonchase/workplace/langchain/libs/langchain/langchain/chains/llm.py:316: UserWarning: The predict_and_parse method is deprecated, instead pass an output parser directly to LLMChain.\n",
-      "  warnings.warn(\n",
-      "/Users/harrisonchase/workplace/langchain/libs/langchain/langchain/chains/llm.py:316: UserWarning: The predict_and_parse method is deprecated, instead pass an output parser directly to LLMChain.\n",
-      "  warnings.warn(\n"
-     ]
-    },
     {
      "name": "stdout",
      "output_type": "stream",
@@ -184,7 +167,7 @@
     "    base_compressor=compressor, base_retriever=retriever\n",
     ")\n",
     "\n",
-    "compressed_docs = compression_retriever.get_relevant_documents(\n",
+    "compressed_docs = compression_retriever.invoke(\n",
     "    \"What did the president say about Ketanji Jackson Brown\"\n",
     ")\n",
     "pretty_print_docs(compressed_docs)"
@@ -204,23 +187,9 @@
   {
    "cell_type": "code",
    "execution_count": 5,
-   "id": "6fa3ec79",
+   "id": "39b13654-01d9-4006-9550-5f3e77cb4f23",
    "metadata": {},
    "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "/Users/harrisonchase/workplace/langchain/libs/langchain/langchain/chains/llm.py:316: UserWarning: The predict_and_parse method is deprecated, instead pass an output parser directly to LLMChain.\n",
-      "  warnings.warn(\n",
-      "/Users/harrisonchase/workplace/langchain/libs/langchain/langchain/chains/llm.py:316: UserWarning: The predict_and_parse method is deprecated, instead pass an output parser directly to LLMChain.\n",
-      "  warnings.warn(\n",
-      "/Users/harrisonchase/workplace/langchain/libs/langchain/langchain/chains/llm.py:316: UserWarning: The predict_and_parse method is deprecated, instead pass an output parser directly to LLMChain.\n",
-      "  warnings.warn(\n",
-      "/Users/harrisonchase/workplace/langchain/libs/langchain/langchain/chains/llm.py:316: UserWarning: The predict_and_parse method is deprecated, instead pass an output parser directly to LLMChain.\n",
-      "  warnings.warn(\n"
-     ]
-    },
     {
      "name": "stdout",
      "output_type": "stream",
@@ -245,7 +214,7 @@
     "    base_compressor=_filter, base_retriever=retriever\n",
     ")\n",
     "\n",
-    "compressed_docs = compression_retriever.get_relevant_documents(\n",
+    "compressed_docs = compression_retriever.invoke(\n",
     "    \"What did the president say about Ketanji Jackson Brown\"\n",
     ")\n",
     "pretty_print_docs(compressed_docs)"
@@ -264,7 +233,7 @@
   {
    "cell_type": "code",
    "execution_count": 6,
-   "id": "e84aceea",
+   "id": "ee8d9486-db9a-4e24-aa11-ae40f34cc908",
    "metadata": {},
    "outputs": [
     {
@@ -293,21 +262,7 @@
       "\n",
       "We’re putting in place dedicated immigration judges so families fleeing persecution and violence can have their cases heard faster. \n",
       "\n",
-      "We’re securing commitments and supporting partners in South and Central America to host more refugees and secure their own borders.\n",
-      "----------------------------------------------------------------------------------------------------\n",
-      "Document 3:\n",
-      "\n",
-      "And for our LGBTQ+ Americans, let’s finally get the bipartisan Equality Act to my desk. The onslaught of state laws targeting transgender Americans and their families is wrong. \n",
-      "\n",
-      "As I said last year, especially to our younger transgender Americans, I will always have your back as your President, so you can be yourself and reach your God-given potential. \n",
-      "\n",
-      "While it often appears that we never agree, that isn’t true. I signed 80 bipartisan bills into law last year. From preventing government shutdowns to protecting Asian-Americans from still-too-common hate crimes to reforming military justice. \n",
-      "\n",
-      "And soon, we’ll strengthen the Violence Against Women Act that I first wrote three decades ago. It is important for us to show the nation that we can come together and do big things. \n",
-      "\n",
-      "So tonight I’m offering a Unity Agenda for the Nation. Four big things we can do together.  \n",
-      "\n",
-      "First, beat the opioid epidemic.\n"
+      "We’re securing commitments and supporting partners in South and Central America to host more refugees and secure their own borders.\n"
      ]
     }
    ],
@@ -321,7 +276,7 @@
     "    base_compressor=embeddings_filter, base_retriever=retriever\n",
     ")\n",
     "\n",
-    "compressed_docs = compression_retriever.get_relevant_documents(\n",
+    "compressed_docs = compression_retriever.invoke(\n",
     "    \"What did the president say about Ketanji Jackson Brown\"\n",
     ")\n",
     "pretty_print_docs(compressed_docs)"
@@ -340,7 +295,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 8,
+   "execution_count": 7,
    "id": "617a1756",
    "metadata": {},
    "outputs": [],
@@ -359,8 +314,8 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 9,
-   "id": "c715228a",
+   "execution_count": 8,
+   "id": "40b9c1db-7ac2-4257-935a-b107da50bb43",
    "metadata": {},
    "outputs": [
     {
@@ -398,7 +353,7 @@
     "    base_compressor=pipeline_compressor, base_retriever=retriever\n",
     ")\n",
     "\n",
-    "compressed_docs = compression_retriever.get_relevant_documents(\n",
+    "compressed_docs = compression_retriever.invoke(\n",
     "    \"What did the president say about Ketanji Jackson Brown\"\n",
     ")\n",
     "pretty_print_docs(compressed_docs)"
@@ -429,7 +384,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.10.1"
+   "version": "3.10.4"
   }
  },
  "nbformat": 4,