-
Notifications
You must be signed in to change notification settings - Fork 219
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Yolox-table-structure to extract tables as markdown #444
Conversation
With a443502, users must manually specify the image name and tag in the docker compose --profile yolox-table-structure up --build To enable this extraction mode in the client, users must explicitly set ingestor = (
Ingestor(
message_client_hostname="nv-ingest-ms-runtime",
message_client_kwargs={"max_retries": 10},
)
.files(files)
.extract(
extract_text=True,
extract_tables=True,
extract_charts=True,
extract_images=False,
text_depth="page",
paddle_output_format="markdown", #default is "pseudo_markdown" until table structure NIM release
) |
raise | ||
|
||
return results | ||
# Ensure both clients returned lists of results matching the number of input images. | ||
if not (isinstance(yolox_results, list) and isinstance(paddle_results, list)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want this to be a full failure of the job? It indicates a lack of agreement between one yolox model and another, but its not clear that it should cause document processing to fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about something like 34d7ce1?
…ingest into edwardk/yolox-table-structure
Description
This PR introduces
yolox-table-structure
as an optional extraction mode. This new NIM allows us to extract tables in markdown format by detecting the bounding boxes of each table cell.Usage
Since the new NIM is not yet public, users must manually specify the image name and tag in the
.env
file. Additionally, the service must be started using docker compose with a profile, e.g.,To enable this extraction mode in the client, users must explicitly set
paddle_output_format="markdown"
, e.g.,Notes
paddle_output_format
remains"pseudo_markdown"
until the table structure NIM is officially released..env
file contains the correct image name and tag before running the extraction.Checklist