Use Yolox-table-structure to extract tables as markdown #444

edknv · 2025-02-14T00:07:11Z

Description

This PR introduces yolox-table-structure as an optional extraction mode. This new NIM allows us to extract tables in markdown format by detecting the bounding boxes of each table cell.

Usage

Since the new NIM is not yet public, users must manually specify the image name and tag in the .env file. Additionally, the service must be started using docker compose with a profile, e.g.,

docker compose --profile yolox-table-structure up --build

To enable this extraction mode in the client, users must explicitly set paddle_output_format="markdown", e.g.,

ingestor = (
    Ingestor(
        message_client_hostname="nv-ingest-ms-runtime",
        message_client_kwargs={"max_retries": 10},
    )
    .files(files)
    .extract(
        extract_text=True,
        extract_tables=True,
        extract_charts=True,
        extract_images=False,
        text_depth="page",
        paddle_output_format="markdown",  #default is "pseudo_markdown" until table structure NIM release 
    )

Notes

The default paddle_output_format remains "pseudo_markdown" until the table structure NIM is officially released.
Ensure the .env file contains the correct image name and tag before running the extraction.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

This reverts commit a565af2.

This reverts commit 9a481e2.

…kdown as default

edknv · 2025-02-18T17:46:09Z

With a443502, users must manually specify the image name and tag in the .env file. Additionally, the service must be started using docker compose with a profile, e.g.,

docker compose --profile yolox-table-structure up --build

To enable this extraction mode in the client, users must explicitly set paddle_output_format="markdown", e.g.,

ingestor = (
    Ingestor(
        message_client_hostname="nv-ingest-ms-runtime",
        message_client_kwargs={"max_retries": 10},
    )
    .files(files)
    .extract(
        extract_text=True,
        extract_tables=True,
        extract_charts=True,
        extract_images=False,
        text_depth="page",
        paddle_output_format="markdown",  #default is "pseudo_markdown" until table structure NIM release 
    )

drobison00 · 2025-02-18T19:46:06Z

src/nv_ingest/stages/nim/table_extraction.py

            raise

-    return results
+    # Ensure both clients returned lists of results matching the number of input images.
+    if not (isinstance(yolox_results, list) and isinstance(paddle_results, list)):


Do we want this to be a full failure of the job? It indicates a lack of agreement between one yolox model and another, but its not clear that it should cause document processing to fail.

How about something like 34d7ce1?

…ingest into edwardk/yolox-table-structure

edknv and others added 10 commits February 13, 2025 16:05

Use Yolox-table-structure to extract tables as markdown

a43680a

Merge branch 'main' into edwardk/yolox-table-structure

587b999

all 496 pages in bo20 processed with no errors

dff3fda

use the correct bbox postprocessing function

63d221f

make default table_content_format markdown

9a481e2

rename paddle_output_format to table_content_format

a565af2

minor fix for response length check

c05fae5

Revert "rename paddle_output_format to table_content_format"

50d1495

This reverts commit a565af2.

Revert "make default table_content_format markdown"

50c4008

This reverts commit 9a481e2.

make markdown mode optional with a new compose profile and pseudo-mar…

a443502

…kdown as default

edknv and others added 2 commits February 18, 2025 09:47

make pseudo-markdown the default in pdfium_helper

a500ef5

Merge branch 'main' into edwardk/yolox-table-structure

cda63ae

edknv marked this pull request as ready for review February 18, 2025 18:17

edknv requested a review from a team as a code owner February 18, 2025 18:17

edknv requested review from jperez999, randerzander, jdye64 and drobison00 and removed request for a team February 18, 2025 18:17

edknv and others added 2 commits February 18, 2025 10:25

Merge branch 'main' into edwardk/yolox-table-structure

e5616e7

Merge branch 'main' into edwardk/yolox-table-structure

6813687

drobison00 reviewed Feb 18, 2025

View reviewed changes

edknv added 2 commits February 18, 2025 15:51

don't fail job with unpexpected results from model

34d7ce1

Merge branch 'edwardk/yolox-table-structure' of github.com:NVIDIA/nv-…

825391e

…ingest into edwardk/yolox-table-structure

edknv merged commit 50ebc0a into main Feb 19, 2025
2 of 3 checks passed

edknv deleted the edwardk/yolox-table-structure branch February 19, 2025 04:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Yolox-table-structure to extract tables as markdown #444

Use Yolox-table-structure to extract tables as markdown #444

edknv commented Feb 14, 2025 •

edited

Loading

edknv commented Feb 18, 2025

drobison00 Feb 18, 2025

edknv Feb 18, 2025

Use Yolox-table-structure to extract tables as markdown #444

Use Yolox-table-structure to extract tables as markdown #444

Conversation

edknv commented Feb 14, 2025 • edited Loading

Description

Usage

Notes

Checklist

edknv commented Feb 18, 2025

drobison00 Feb 18, 2025

Choose a reason for hiding this comment

edknv Feb 18, 2025

Choose a reason for hiding this comment

edknv commented Feb 14, 2025 •

edited

Loading