Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Yolox-table-structure to extract tables as markdown #444

Merged
merged 16 commits into from
Feb 19, 2025

Conversation

edknv
Copy link
Collaborator

@edknv edknv commented Feb 14, 2025

Description

This PR introduces yolox-table-structure as an optional extraction mode. This new NIM allows us to extract tables in markdown format by detecting the bounding boxes of each table cell.

Usage

Since the new NIM is not yet public, users must manually specify the image name and tag in the .env file. Additionally, the service must be started using docker compose with a profile, e.g.,

docker compose --profile yolox-table-structure up --build

To enable this extraction mode in the client, users must explicitly set paddle_output_format="markdown", e.g.,

ingestor = (
    Ingestor(
        message_client_hostname="nv-ingest-ms-runtime",
        message_client_kwargs={"max_retries": 10},
    )
    .files(files)
    .extract(
        extract_text=True,
        extract_tables=True,
        extract_charts=True,
        extract_images=False,
        text_depth="page",
        paddle_output_format="markdown",  #default is "pseudo_markdown" until table structure NIM release 
    )

Notes

  • The default paddle_output_format remains "pseudo_markdown" until the table structure NIM is officially released.
  • Ensure the .env file contains the correct image name and tag before running the extraction.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@edknv
Copy link
Collaborator Author

edknv commented Feb 18, 2025

With a443502, users must manually specify the image name and tag in the .env file. Additionally, the service must be started using docker compose with a profile, e.g.,

docker compose --profile yolox-table-structure up --build

To enable this extraction mode in the client, users must explicitly set paddle_output_format="markdown", e.g.,

ingestor = (
    Ingestor(
        message_client_hostname="nv-ingest-ms-runtime",
        message_client_kwargs={"max_retries": 10},
    )
    .files(files)
    .extract(
        extract_text=True,
        extract_tables=True,
        extract_charts=True,
        extract_images=False,
        text_depth="page",
        paddle_output_format="markdown",  #default is "pseudo_markdown" until table structure NIM release 
    )

@edknv edknv marked this pull request as ready for review February 18, 2025 18:17
@edknv edknv requested a review from a team as a code owner February 18, 2025 18:17
@edknv edknv requested review from jperez999, randerzander, jdye64 and drobison00 and removed request for a team February 18, 2025 18:17
raise

return results
# Ensure both clients returned lists of results matching the number of input images.
if not (isinstance(yolox_results, list) and isinstance(paddle_results, list)):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want this to be a full failure of the job? It indicates a lack of agreement between one yolox model and another, but its not clear that it should cause document processing to fail.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about something like 34d7ce1?

@edknv edknv merged commit 50ebc0a into main Feb 19, 2025
2 of 3 checks passed
@edknv edknv deleted the edwardk/yolox-table-structure branch February 19, 2025 04:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants