Refactoring PDF loaders: all #28970

pprados · 2024-12-30T15:35:34Z

This PR is a composition of many other PRs. Modifications will be published one after the other, to facilitate analysis and integration into langchain.

Refactoring all PDF loader and parser: community

Description: refactoring of PDF parsers and loaders. See below
Issue: missing locks, parameter inconsistency, missing lazy approach, split loader and parser, etc.
Twitter handle: pprados
Add tests and docs:
- Add tests to check the consistency of different implementations
- Add tests to check the array and images extraction
- Update or add notebooks in docs/docs/integrations directory
Lint and test: done

Rational

Even though Document has a page_content parameter (rather than text or body), we believe it’s not good practice to work with pages. Indeed, this approach creates memory gaps in RAG projects. If a paragraph spans two pages, the beginning of the paragraph is at the end of one page, while the rest is at the start of the next. With a page-based approach, there will be two separate chunks, each containing part of a sentence. The corresponding vectors won’t be relevant. These chunks are unlikely to be selected when there’s a question specifically about the split paragraph. If one of the chunks is selected, there’s little chance the LLM can answer the question. This issue is worsened by the injection of headers, footers (if parsers haven’t properly removed them), images, or tables at the end of a page, as most current implementations tend to do.

Why is it important to unify the different parsers? Each has its own characteristics and strategies, more or less effective depending on the family of PDF files. One strategy is to identify the family of the PDF file (by inspecting the metadata or the content of the first page) and then select the most efficient parser in that case. By unifying parsers, the following code doesn't need to deal with the specifics of different parsers, as the result is similar for each. We'll propose a Parser using this strategy in another PR.

The PR

We propose a substantial PR to improve the different PDF parser integrations. All my clients struggle with PDFs. I took the initiative to address this issue at its root by refactoring the various integrations of Python PDF parsers. The goal is to standardize a minimum set of parameters and metadata and bring improvements to each one (bug fixes, feature additions).

Don't worry about the size of the PR. In the end, there are only two modified files. The rest is just updating unit tests and docs.

source	what
- `langchain_community/document_loaders/pdf.py` - `langchain_community/document_loaders/parsers/pdf.py`	Modified source code
- `langchain_community/tests/integration_tests/document_loaders/pdf.py` - `langchain_community/tests/integration_tests/document_loaders/parsers/pdf.py`	Modified tests
- `docs/docs/integrations/document_loaders/pdf.ipynb`	A replication off one notebook
- `docs/docs/how_to/document_loader_pdf.ipynb`	An overview of pdf parsing
- `docs/docs/how_to/document_loader_custom.ipynb`	Enhanced separation of loaders and parsers

In order to qualify all the code, we worked in a separate project, using the langchain-common structure. In this way, we can compare the results of the historical implementation with the new ones.

We understand that it's important to ensure that changes don't have a significant impact on existing code. That's why we used a parallel project, using the langchain-common structure, to test PDF readings before and after modifications. This allows us to compare results. You'll find all the files here.
The only difference is the name to import classes.

All this // project is available here. Consult the compare_old_new directory with your development environment, using DIFF to identify differences.

git clone https://github.com/pprados/patch_langchain_common.git
cd patch_langchain_common/compare_old_new

Metadata

All parsers use lowercase keys for pdf file metadata. Except PDFPlumberParser. For this particular case, we've added a dictionary wrapper that warns when keys with upper case letters are used.

Images

The current implementation in LangChain involves asking each parser for the text on a page, then retrieving images to apply OCR. The text extracted from images is then appended to the end of the page text, which may split paragraphs across pages, worsening the RAG model’s performance.

To avoid this, we modified the strategy for injecting OCR results from images. Now, the result is inserted between two paragraphs of text (\n\n or \n), just before the end of the page. This allows a half-paragraph to be combined with the first paragraph of the following page.

Currently, the LangChain implementation uses RapidOCR to analyze images and extract any text. This algorithm is designed to work with Chinese and English, not other languages. Since the implementation uses a function rather than a method, it’s not possible to modify it. We have modified the various parsers to allow for selecting the algorithm to analyze images. Now, it’s possible to use RapidOCR, Tesseract, or invoke a multimodal LLM to get a description of the image.

To standardize this, we propose a new abstract class:

class ImagesPdfParser(BaseBlobParser):
    …

For converting images to text, the possible formats are: text, markdown, and HTML. Why is this important? If it’s necessary to split a result, based on the origin of the text fragments, it’s possible to do so at the level of image translations. An identification rule such as ![text](...) or <img …/> allows us to identify text fragments originating from an image.

Tables

Tables present in PDF files are another challenge. Some algorithms can detect part of them. This typically involves a specialized process, separate from the text flow. That is, the text extracted from the page includes each cell's content, sometimes in columns, sometimes in rows. This text is challenging for the LLM to interpret. Depending on the capabilities of the libraries, it may be possible to detect tables, then identify the cell boxes during text extraction to inject the table in its entirety. This way, the flow remains coherent. It’s even possible to add a few paragraphs before and after the table to prompt an LLM to describe it. Only the description of the table will be used for embedding.

Tables identified in PDF pages can be translated into markdown (if there are no merged cells) or HTML (which consumes more tokens). LLMs can then make use of them.

Unfortunately, this approach isn’t always feasible. In such cases, we can apply the approach used for images, by injecting tables and images between two paragraphs in the page’s text flow. This is always better than placing them at the end of the page.

Combining Pages

As mentioned, in a RAG project, we want to work with the text flow of a document, rather than by page. A mode is dedicated to this, which can be configured to specify the character to use for page delimiters in the flow. This could simply be \n, ------\n or \f to clearly indicate a page change, or  for seamless injection in a Markdown viewer without a visual effect.

Why is it important to identify page breaks when retrieving the full document flow? Because we generally want to provide a URL with the chunk’s location when the LLM answers. While it’s possible to reference the entire PDF, this isn’t practical if it’s more than two pages long. It’s better to indicate the specific page to display in the URL. Therefore, assistance is needed so that chunking algorithms can add the page metadata to each chunk. The choice of delimiter helps the algorithm calculate this parameter.

Similarly, we’ve added metadata in all parsers with the total number of pages in the document. Why is this important? If we want to reference a document, we need to determine if it’s relevant. A reference is valid if it helps the user quickly locate the fragment within the document (using the page and/or a chunk excerpt). But if the URL points to a PDF file without a page number (for various reasons) and the file has a large number of pages, we want to remove the reference that doesn’t assist the user. There’s no point in referencing a 100-page document! The total_pages metadata can then be used. We recommend this approach in an extension to LangChain that we propose for managing document references: langchain-reference.

Compatibility

We have tried, as much as possible, to maintain compatibility with the previous version. This is reflected in preserving the order of parameters and using the default values for each implementation so that the results remain similar. The unit and integration tests for the various parsers have not been modified; they are still valid.

Ideally, we would prefer an interface like:

class XXXLoader(...):
  def __init__(file_path, *, ...):
    ...

but this could break compatibility for positional arguments.

Perhaps it would be feasible to plan a migration for LangChain v1.0 by modifying the default parameters to make them mandatory during the transition to v1.0. At that point, we could reintroduce default values.

Normalisation

The AzureAIDocumentIntelligenceParser class introduces the mode parameter, which accepts the values single, page, and markdown.
The deprecated UnstructuredPDFLoader class introduces the mode parameter, which accepts the values single, paged, and markdown.
Based on this model, we are extending the presence of the mode parameter to most parsers, with the value single, page, and markdown.
paged is declared depreciated.

The different Loader and BlobParser classes now offer the following parameters:

file_path str or PurePath with the file name.
password str with the file password, if needed.
mode to return a single document per file or one document per page (extended with elements in the case of Unstructured or other specific parser).
pages_delimiter to specify how to join pages (\f by default).
extract_images to enable image extraction (already present in most Loaders/Parsers).
images_to_text to specify how to handle images (invoking OCR, LLM, etc.).
extract_tables to allow extraction of tables detected by underlying libraries, for certain parsers.
Other parameters are specific to each parser.

The integration of image texts is now between two paragraphs.

For the images_to_text parameter, we propose three functions:

convert_images_to_text_with_rapidocr()
convert_images_to_text_with_tesseract()
convert_images_to_description()

Here’s how it’s used:

XXXLoader(
  file_path,
  images_to_text=convert_images_to_description(
    model=ChatOpenAI(model="gpt-4o", max_tokens=1024),
    format="markdown")
)

Tables

Some parsers are able to extract arrays, but this is not integrated into langchain. We've added the necessary features to take this into account.

PyMuPDFLoader
PDFPlumberLoader
ZeroxPDFLoader
UnstructuredPDFLoader

Metadata

The different parsers offer a minimum set of common metadata:

source
page
total_page
creationdate
creator
producer
and whatever additional metadata the modules can extract from PDF files.
Dates are converted to ISO 8601 format for easier handling and consistency with other file formats.

All keys are in lowercase.

Tests

We propose matrix tests to validate all parsers compatible with the new approach.

test_standard_parameters()
test_parser_with_table()

To validate all the parsers, we retrieved all the PDF files used by each parser for its own tests, and invoked all the parsers from langchain, along with all these files. This ensures that there are no crashes when parsing a PDF file.

New features of parsers

We resume the modification for each parsers

	metadata	images	table	password	parser	deprecared	lazy_load	lock
PyPDF	✔	✔
PyPDFium2	✔	✔		✔				✔
PyPDFMiner	✔	✔		✔			✔
PyMuPDF	✔	✔	✔	✔				✔
PDFPlumber	✔	✔	✔	✔			✔	✔
OnlinePDF		✔					✔
ZeroxPDF	✔	✔	✔		✔
UnstructuredPDF	✔	✔	✔	✔	✔			✔
UnstructuredPDF						✔
PyPDFDirectory		✔		✔		✔
PagedPDFSplitter						✔
OnlinePDF						✔
UnstructuredPDF	✔	✔	✔	✔	✔			✔

PyPDFMinerLoader: When the extract_images parameter is set to true, the current implementation does not respect the concatenate_pages parameter. It returns multiple pages instead of a single one, as specified by default.

OnlinePDFLoader: This class is a poorly implemented (lacking lazy_load()) wrapper around UnstructuredPDFLoader.

parser: Split the loader and parser. As discussed in the LangChain documentation, it can be useful to decouple analysis logic from loading logic, making it easier to reuse a given analyzer regardless of how the data has been loaded. Where necessary, we have split the two logics.

New loader / parsers

New parsers will be introduced in a separate pull request.

Classes	What
`UnstructuredPDF`	Extend unstructured to be conform to the new specification
`LlamaIndexPDF`	Integration of online LlamaIndex API
`PyMuPDF4LLM`	Integration of PyMuPDF4LLM
`PDFRouter`	Dynamically selects the parser
`DoclingPDF`	Use Docling
`PDFMulti`	Use multiple parser and select the best

For example, with the unification of parsers, it will be possible to choose the parser according to the characteristics of the PDF file.

  routes = [
      ("Microsoft", {"producer": "Microsoft", "creator": "Excel"}, PyMuPDFParser()),
      ("Microsoft", {"producer": "Microsoft", "creator": "Word"}, ZeroxPDFParser()),
      ("LibreOffice", {"producer": "LibreOffice", }, PDFPlumberParser()),
      ("Xdvipdfmx", {"producer": "xdvipdfmx.*", "page1":"Hello"}, PDFPlumberParser()),
      ("default", {}, PyPDFium2Parser())
  ]
  loader = PDFRouterLoader(HELLO_PDF,routes=routes)
  loader.load()

This will be present in other PRs.

Succession of PR

Step	What?
01	Prepare the upcoming PRs
02	PyMUPDF
03	PyPDF
04	PDFMiner
05	PyPDFium2
06	PDFPlumber
07	ZeroxPDF
08	Unstructured
09	how_to
10	deprecated
11	PDFRouter
12	LlamaIndex
...	...

vercel · 2024-12-30T15:35:38Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
langchain	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Jan 3, 2025 5:15pm

**Description:** bump gritql dependency, to use new binary names from [here](getgrit/gritql#565) **Issue:** fixes langchain-ai#27822

Co-authored-by: Erick Friis <[email protected]>

…ai#28975) Remove redundant word for improved sentence fluency Co-authored-by: Erick Friis <[email protected]>

@sathesh

Description: Document update. A minor typo is fixed. Install lxml as required. Issue: - Dependencies: - Twitter handle: @sathesh --------- Co-authored-by: Erick Friis <[email protected]>

## Description To integrate ModelScope inference API endpoints for both Embeddings, LLMs and ChatModels, install the package `langchain-modelscope-integration` (as discussed in issue langchain-ai#28928 ). This is necessary because the package name `langchain-modelscope` was already registered by another party. ModelScope is a premier platform designed to connect model checkpoints with model applications. It provides the necessary infrastructure to share open models and promote model-centric development. For more information, visit GitHub page: [ModelScope](https://github.com/modelscope).

Believe the current implementation raises PydanticUserError following [this](https://github.com/pydantic/pydantic/releases/tag/v2.10.1) Pydantic release. Resolves langchain-ai#28989

…ser` (langchain-ai#28959) - **Description:** Fix the `body` keyword argument for AzureAIDocumentIntelligenceParser` - **Issue:** langchain-ai#28948

Description: Add a missing 'has' verb in the Streaming Conceptual Guide.

**Description:** This PR updates the codebase to reflect the deprecation of the AgentType feature. It includes the following changes: Documentation Update: Added a deprecation notice to the AgentType class comment. Provided a reference to the official LangChain migration guide for transitioning to LangGraph agents. Reference Link: https://python.langchain.com/docs/how_to/migrate_agent/ **Twitter handle:** @hrrrriiiishhhhh --------- Co-authored-by: Chester Curme <[email protected]>

…langchain-ai#28984) ### Description - In the example, remove `llama-2-13b-chat`, `mixtral-8x7b-instruct-v0-1`. - Fix llm friendli streaming implementation. - Update examples in documentation and remove duplicates. ### Issue N/A ### Dependencies None ### Twitter handle `@friendliai`

…Template` (langchain-ai#28969) - **Description:** Very small change in Docstring for `BasePromptTemplate` - **Issue:** langchain-ai#28966

accross -> across

) This pull request updates the documentation in `docs/docs/how_to/custom_tools.ipynb` to reflect the recommended approach for generating JSON schemas in Pydantic. Specifically, it replaces instances of the deprecated `schema()` method with the newer and more versatile `model_json_schema()`.

…to `auto` (langchain-ai#28961) - **Description:** `DuckDuckGoSearchAPIWrapper` default value for backend has been changed to avoid User Warning - **Issue:** langchain-ai#28957

This PR is to correct a simple typo in how-to guides section.

Before: ![Screenshot 2025-01-02 at 1 49 30 PM](https://github.com/user-attachments/assets/cb30526a-fc0b-439f-96d1-962c226d9dc7) After: ![Screenshot 2025-01-02 at 1 49 38 PM](https://github.com/user-attachments/assets/32c747ea-6391-4dec-b778-df457695d197)

- In this PR, I have updated the AzureML Endpoint with the latest endpoint. - **Description:** I have changed the existing `/chat/completions` to `/models/chat/completions` in libs/community/langchain_community/llms/azureml_endpoint.py - **Issue:** langchain-ai#25702 --------- Co-authored-by: = <=>

…angchain-ai#28914) This commit updates the documentation and package registry for the FalkorDB Chat Message History integration. **Changes:** - Added a comprehensive example notebook falkordb_chat_message_history.ipynb demonstrating how to use FalkorDB for session-based chat message storage. - Added a provider notebook for FalkorDB - Updated libs/packages.yml to register FalkorDB as an integration package, following LangChain's new guidelines for community integrations. **Notes:** - This update aligns with LangChain's process for registering new integrations via documentation updates and package registry modifications. - No functional or core package changes were made in this commit. --------- Co-authored-by: Chester Curme <[email protected]>

…Demonstration (langchain-ai#28938) ## Description This pull request updates the documentation for FAISS regarding filter construction, following the changes made in commit `df5008f`. ## Issue None. This is a follow-up PR for documentation of [langchain-ai#28207](langchain-ai#28207) ## Dependencies: None. --------- Co-authored-by: Chester Curme <[email protected]>

…in-ai#28902) Problem: "Optional" object is used in one example without importing, which raises the following error when copying the example into IDE or Jupyter Lab ![image](https://github.com/user-attachments/assets/3a6c48cc-937f-4774-979b-b3da64ced247) Solution: Just importing Optional from typing_extensions module, this solves the problem! --------- Co-authored-by: Erick Friis <[email protected]>

@eyurtsev

- **Refactoring PDF loaders step 1**: "community: Refactoring PDF loaders to standardize approaches" - **Description:** Declare CloudBlobLoader in __init__.py. file_path is Union[str, PurePath] anywhere - **Twitter handle:** pprados This is one part of a larger Pull Request (PR) that is too large to be submitted all at once. This specific part focuses to prepare the update of all parsers. For more details, see [PR 28970](#28970). @eyurtsev it's the start of a PR series.

Refactoring all PDF loader and parser

4f4d79e

pprados force-pushed the pprados/pdf_loaders branch 7 times, most recently from ad07a86 to 7cd0649 Compare January 3, 2025 14:28

vercel bot deployed to Preview January 3, 2025 14:43 View deployment

pprados and others added 20 commits January 3, 2025 17:50

Refactoring all PDF loader and parser

3315af6

cli: bump gritql version (langchain-ai#28981)

5ab608a

**Description:** bump gritql dependency, to use new binary names from [here](getgrit/gritql#565) **Issue:** fixes langchain-ai#27822

infra: speed up unit tests (langchain-ai#28974)

2fc4c6b

Co-authored-by: Erick Friis <[email protected]>

docs: Remove redundant word for improved sentence fluency (langchain-…

f13bde7

…ai#28975) Remove redundant word for improved sentence fluency Co-authored-by: Erick Friis <[email protected]>

docs: Minor typo fixed, install necessary pip (langchain-ai#28976)

931229f

Description: Document update. A minor typo is fixed. Install lxml as required. Issue: - Dependencies: - Twitter handle: @sathesh --------- Co-authored-by: Erick Friis <[email protected]>

community[patch]: fix instantiation for Slack tools (langchain-ai#28990)

4f3025d

Believe the current implementation raises PydanticUserError following [this](https://github.com/pydantic/pydantic/releases/tag/v2.10.1) Pydantic release. Resolves langchain-ai#28989

(Community): Fix Keyword argument for `AzureAIDocumentIntelligencePar…

f2f78da

…ser` (langchain-ai#28959) - **Description:** Fix the `body` keyword argument for AzureAIDocumentIntelligenceParser` - **Issue:** langchain-ai#28948

docs: Update streaming.mdx (langchain-ai#28985)

942848e

Description: Add a missing 'has' verb in the Streaming Conceptual Guide.

(Core) Small Change in Docstring for method partial for `BasePrompt…

f9a943f

…Template` (langchain-ai#28969) - **Description:** Very small change in Docstring for `BasePromptTemplate` - **Issue:** langchain-ai#28966

docs: update multi_vector.ipynb (langchain-ai#28954)

4d05358

accross -> across

(Community): DuckDuckGoSearchAPIWrapper backend changed from api …

b5ab049

…to `auto` (langchain-ai#28961) - **Description:** `DuckDuckGoSearchAPIWrapper` default value for backend has been changed to avoid User Warning - **Issue:** langchain-ai#28957

docs: fix typo in how-to guides (langchain-ai#28951)

e9d9327

This PR is to correct a simple typo in how-to guides section.

docs[patch]: fix link (langchain-ai#28994)

f980513

RuofanChen03 and others added 3 commits January 3, 2025 17:50

docs: add stripe toolkit (langchain-ai#28122)

523bd46

pprados force-pushed the pprados/pdf_loaders branch from 7cd0649 to 5b0eba0 Compare January 3, 2025 16:50

Update PyPDF

d9a1b0c

pprados force-pushed the pprados/pdf_loaders branch from fe2e4a7 to d9a1b0c Compare January 3, 2025 17:03

Merge branch 'master' into pprados/pdf_loaders

54f3303

vercel bot deployed to Preview January 3, 2025 17:15 View deployment

This was referenced Jan 7, 2025

Refactoring PDF loaders: 01 prepare #29062

Merged

Refactoring PDF loaders: 02 PyMuPDF #29063

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactoring PDF loaders: all #28970

Refactoring PDF loaders: all #28970

pprados commented Dec 30, 2024 •

edited

Loading

vercel bot commented Dec 30, 2024 •

edited

Loading

Refactoring PDF loaders: all #28970

Are you sure you want to change the base?

Refactoring PDF loaders: all #28970

Conversation

pprados commented Dec 30, 2024 • edited Loading

Refactoring all PDF loader and parser: community

Rational

The PR

Metadata

Images

Tables

Combining Pages

Compatibility

Normalisation

Tables

Metadata

Tests

New features of parsers

New loader / parsers

Succession of PR

vercel bot commented Dec 30, 2024 • edited Loading

pprados commented Dec 30, 2024 •

edited

Loading

vercel bot commented Dec 30, 2024 •

edited

Loading