Skip to content

Commit

Permalink
Update supported file types docx, jpeg, pdf, png, pptx, `sv…
Browse files Browse the repository at this point in the history
…g`, `tiff`, `txt` (#481)
  • Loading branch information
nkmcalli authored Feb 21, 2025
1 parent 5ca642c commit 5864b92
Show file tree
Hide file tree
Showing 4 changed files with 42 additions and 20 deletions.
33 changes: 22 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,22 +26,33 @@ NVIDIA Ingest enables parallelization of the process of splitting documents into

## Introduction

### What NVIDIA-Ingest Is ✔️
## What NVIDIA-Ingest Is ✔️

A microservice that:
NV-Ingest is a microservice service that does the following:

- Accepts a JSON Job description, containing a document payload, and a set of ingestion tasks to perform on that payload.
- Allows the results of a Job to be retrieved; the result is a JSON dictionary containing a list of Metadata describing objects extracted from the base document, as well as processing annotations and timing/trace data.
- Supports PDF, Docx, pptx, and images.
- Supports multiple methods of extraction for each document type in order to balance trade-offs between throughput and accuracy. For example, for PDF documents we support extraction via pdfium, Unstructured.io, and Adobe Content Extraction Services.
- Supports various types of pre and post processing operations, including text splitting and chunking; transform, and filtering; embedding generation, and image offloading to storage.
- Accept a JSON job description, containing a document payload, and a set of ingestion tasks to perform on that payload.
- Allow the results of a job to be retrieved. The result is a JSON dictionary that contains a list of metadata describing objects extracted from the base document, and processing annotations and timing/trace data.
- Support multiple methods of extraction for each document type to balance trade-offs between throughput and accuracy. For example, for .pdf documents, we support extraction through pdfium, Unstructured.io, and Adobe Content Extraction Services.
- Support various types of pre- and post- processing operations, including text splitting and chunking, transform and filtering, embedding generation, and image offloading to storage.

### What NVIDIA-Ingest Is Not ✖️
NV-Ingest supports the following file types:

A service that:
- `docx`
- `jpeg`
- `pdf`
- `png`
- `pptx`
- `svg`
- `tiff`
- `txt`

- Runs a static pipeline or fixed set of operations on every submitted document.
- Acts as a wrapper for any specific document parsing library.

## What NVIDIA-Ingest Isn't ✖️

NV-Ingest does not do the following:

- Run a static pipeline or fixed set of operations on every submitted document.
- Act as a wrapper for any specific document parsing library.


## Prerequisites
Expand Down
2 changes: 1 addition & 1 deletion client/src/nv_ingest_client/nv_ingest_cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -147,7 +147,7 @@
- extract: Extracts content from documents, customizable per document type.
Can be specified multiple times for different 'document_type' values.
Options:
- document_type (str): Document format ('pdf', 'docx', 'pptx', 'html', 'xml', 'excel', 'csv', 'parquet'). Required.
- document_type (str): Document format (`docx`, `jpeg`, `pdf`, `png`, `pptx`, `svg`, `tiff`, `txt`). Required.
- extract_charts (bool): Enables chart extraction. Default: False.
- extract_images (bool): Enables image extraction. Default: False.
- extract_method (str): Extraction technique. Defaults are smartly chosen based on 'document_type'.
Expand Down
2 changes: 1 addition & 1 deletion docs/docs/user-guide/developer-guide/nv-ingest_cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ Options:
- extract: Extracts content from documents, customizable per document type.
Can be specified multiple times for different 'document_type' values.
Options:
- document_type (str): Document format ('pdf', 'docx', 'pptx', 'html', 'xml', 'excel', 'csv', 'parquet'). Required.
- document_type (str): Document format (`docx`, `jpeg`, `pdf`, `png`, `pptx`, `svg`, `tiff`, `txt`). Required.
- text_depth (str): Depth at which text parsing occurs ('document', 'page'), additional text_depths are partially supported and depend on the specified extraction method ('block', 'line', 'span')
- extract_method (str): Extraction technique. Defaults are smartly chosen based on 'document_type'.
- extract_text (bool): Enables text extraction. Default: False.
Expand Down
25 changes: 18 additions & 7 deletions docs/docs/user-guide/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,15 +18,26 @@ and optionally manage storing into a vector database [Milvus](https://milvus.io/

NV-Ingest is a microservice service that does the following:

- Accepts a JSON Job description, containing a document payload, and a set of ingestion tasks to perform on that payload.
- Allows the results of a Job to be retrieved; the result is a JSON dictionary containing a list of Metadata describing objects extracted from the base document, and processing annotations and timing/trace data.
- Supports .pdf, .docx, .pptx, and images.
- Supports multiple methods of extraction for each document type to balance trade-offs between throughput and accuracy. For example, for PDF documents, we support extraction through pdfium, Unstructured.io, and Adobe Content Extraction Services.
- Supports various types of pre and post processing operations, including text splitting and chunking, transform and filtering, embedding generation, and image offloading to storage.
- Accept a JSON job description, containing a document payload, and a set of ingestion tasks to perform on that payload.
- Allow the results of a job to be retrieved. The result is a JSON dictionary that contains a list of metadata describing objects extracted from the base document, and processing annotations and timing/trace data.
- Support multiple methods of extraction for each document type to balance trade-offs between throughput and accuracy. For example, for .pdf documents, we support extraction through pdfium, Unstructured.io, and Adobe Content Extraction Services.
- Support various types of pre- and post- processing operations, including text splitting and chunking, transform and filtering, embedding generation, and image offloading to storage.

NV-Ingest supports the following file types:

- `docx`
- `jpeg`
- `pdf`
- `png`
- `pptx`
- `svg`
- `tiff`
- `txt`


## What NVIDIA-Ingest Isn't ✖️

NV-Ingest does not do the following:

- Runs a static pipeline or fixed set of operations on every submitted document.
- Acts as a wrapper for any specific document parsing library.
- Run a static pipeline or fixed set of operations on every submitted document.
- Act as a wrapper for any specific document parsing library.

0 comments on commit 5864b92

Please sign in to comment.