Update supported file types docx, jpeg, pdf, png, pptx, `sv…

…g`, `tiff`, `txt` (#481)
NVIDIA · Feb 21, 2025 · 5864b92 · 5864b92
1 parent 5ca642c
commit 5864b92
Show file tree

Hide file tree

Showing 4 changed files with 42 additions and 20 deletions.
diff --git a/README.md b/README.md
@@ -26,22 +26,33 @@ NVIDIA Ingest enables parallelization of the process of splitting documents into
 
 ## Introduction
 
-### What NVIDIA-Ingest Is ✔️
+## What NVIDIA-Ingest Is ✔️
 
-A microservice that:
+NV-Ingest is a microservice service that does the following:
 
-- Accepts a JSON Job description, containing a document payload, and a set of ingestion tasks to perform on that payload.
-- Allows the results of a Job to be retrieved; the result is a JSON dictionary containing a list of Metadata describing objects extracted from the base document, as well as processing annotations and timing/trace data.
-- Supports PDF, Docx, pptx, and images.
-- Supports multiple methods of extraction for each document type in order to balance trade-offs between throughput and accuracy. For example, for PDF documents we support extraction via pdfium, Unstructured.io, and Adobe Content Extraction Services.
-- Supports various types of pre and post processing operations, including text splitting and chunking; transform, and filtering; embedding generation, and image offloading to storage.
+- Accept a JSON job description, containing a document payload, and a set of ingestion tasks to perform on that payload.
+- Allow the results of a job to be retrieved. The result is a JSON dictionary that contains a list of metadata describing objects extracted from the base document, and processing annotations and timing/trace data.
+- Support multiple methods of extraction for each document type to balance trade-offs between throughput and accuracy. For example, for .pdf documents, we support extraction through pdfium, Unstructured.io, and Adobe Content Extraction Services.
+- Support various types of pre- and post- processing operations, including text splitting and chunking, transform and filtering, embedding generation, and image offloading to storage.
 
-### What NVIDIA-Ingest Is Not ✖️
+NV-Ingest supports the following file types:
 
-A service that:
+- `docx`
+- `jpeg`
+- `pdf`
+- `png`
+- `pptx`
+- `svg`
+- `tiff`
+- `txt`
 
-- Runs a static pipeline or fixed set of operations on every submitted document.
-- Acts as a wrapper for any specific document parsing library.
+
+## What NVIDIA-Ingest Isn't ✖️
+
+NV-Ingest does not do the following:
+
+- Run a static pipeline or fixed set of operations on every submitted document.
+- Act as a wrapper for any specific document parsing library.
 
 
 ## Prerequisites

diff --git a/client/src/nv_ingest_client/nv_ingest_cli.py b/client/src/nv_ingest_client/nv_ingest_cli.py
@@ -147,7 +147,7 @@
 - extract: Extracts content from documents, customizable per document type.
     Can be specified multiple times for different 'document_type' values.
     Options:
-    - document_type (str): Document format ('pdf', 'docx', 'pptx', 'html', 'xml', 'excel', 'csv', 'parquet'). Required.
+    - document_type (str): Document format (`docx`, `jpeg`, `pdf`, `png`, `pptx`, `svg`, `tiff`, `txt`). Required.
     - extract_charts (bool): Enables chart extraction. Default: False.
     - extract_images (bool): Enables image extraction. Default: False.
     - extract_method (str): Extraction technique. Defaults are smartly chosen based on 'document_type'.

diff --git a/docs/docs/user-guide/developer-guide/nv-ingest_cli.md b/docs/docs/user-guide/developer-guide/nv-ingest_cli.md
@@ -51,7 +51,7 @@ Options:
                                   - extract: Extracts content from documents, customizable per document type.
                                       Can be specified multiple times for different 'document_type' values.
                                       Options:
-                                      - document_type (str): Document format ('pdf', 'docx', 'pptx', 'html', 'xml', 'excel', 'csv', 'parquet'). Required.
+                                      - document_type (str): Document format (`docx`, `jpeg`, `pdf`, `png`, `pptx`, `svg`, `tiff`, `txt`). Required.
                                       - text_depth (str): Depth at which text parsing occurs ('document', 'page'), additional text_depths are partially supported and depend on the specified extraction method ('block', 'line', 'span')
                                       - extract_method (str): Extraction technique. Defaults are smartly chosen based on 'document_type'.
                                       - extract_text (bool): Enables text extraction. Default: False.

diff --git a/docs/docs/user-guide/index.md b/docs/docs/user-guide/index.md
@@ -18,15 +18,26 @@ and optionally manage storing into a vector database [Milvus](https://milvus.io/
 
 NV-Ingest is a microservice service that does the following:
 
-- Accepts a JSON Job description, containing a document payload, and a set of ingestion tasks to perform on that payload.
-- Allows the results of a Job to be retrieved; the result is a JSON dictionary containing a list of Metadata describing objects extracted from the base document, and processing annotations and timing/trace data.
-- Supports .pdf, .docx, .pptx, and images.
-- Supports multiple methods of extraction for each document type to balance trade-offs between throughput and accuracy. For example, for PDF documents, we support extraction through pdfium, Unstructured.io, and Adobe Content Extraction Services.
-- Supports various types of pre and post processing operations, including text splitting and chunking, transform and filtering, embedding generation, and image offloading to storage.
+- Accept a JSON job description, containing a document payload, and a set of ingestion tasks to perform on that payload.
+- Allow the results of a job to be retrieved. The result is a JSON dictionary that contains a list of metadata describing objects extracted from the base document, and processing annotations and timing/trace data.
+- Support multiple methods of extraction for each document type to balance trade-offs between throughput and accuracy. For example, for .pdf documents, we support extraction through pdfium, Unstructured.io, and Adobe Content Extraction Services.
+- Support various types of pre- and post- processing operations, including text splitting and chunking, transform and filtering, embedding generation, and image offloading to storage.
+
+NV-Ingest supports the following file types:
+
+- `docx`
+- `jpeg`
+- `pdf`
+- `png`
+- `pptx`
+- `svg`
+- `tiff`
+- `txt`
+
 
 ## What NVIDIA-Ingest Isn't ✖️
 
 NV-Ingest does not do the following:
 
-- Runs a static pipeline or fixed set of operations on every submitted document.
-- Acts as a wrapper for any specific document parsing library.
+- Run a static pipeline or fixed set of operations on every submitted document.
+- Act as a wrapper for any specific document parsing library.