diff --git a/.devcontainer/README.md b/.devcontainer/README.md index 145fe821..c9116312 100644 --- a/.devcontainer/README.md +++ b/.devcontainer/README.md @@ -19,6 +19,9 @@ limitations under the License. The nv-ingest devcontainer is provided as a quick-to-set-up development and exploration environment for use with [Visual Studio Code](https://code.visualstudio.com) (Code). The devcontainer is a lightweight container which mounts-in a Conda environment with cached packages, alleviating long Conda download times on subsequent launches. It provides a simple framework for adding developer-centric [scripts](#development-scripts), and incorporates some helpful Code plugins. +> [!Note] +> NV-Ingest is also known as NVIDIA Ingest and NeMo Retriever Extraction. + More information about devcontainers can be found at [`containers.dev`](https://containers.dev/). ## Getting Started diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 7d28bd94..e2f84477 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -80,7 +80,7 @@ git submodule update --init --recursive ** [Create a pull request](https://github.com/NVIDIA/nv-ingest/pulls) once your code is ready. 5. **Code Review:** Wait for the review by other developers and make necessary updates. -6. **Merge:** Once approved, an NV-Ingest developer will approve your pull request. +6. **Merge:** After approval, an NVIDIA developer will approve your pull request. ### Seasoned Developers diff --git a/README.md b/README.md index caec04bc..71ebd0ae 100644 --- a/README.md +++ b/README.md @@ -8,6 +8,9 @@ SPDX-License-Identifier: Apache-2.0 NVIDIA-Ingest is a scalable, performance-oriented document content and metadata extraction microservice. Including support for parsing PDFs, Word and PowerPoint documents, it uses specialized NVIDIA NIM microservices to find, contextualize, and extract text, tables, charts and images for use in downstream generative applications. +> [!Note] +> NVIDIA Ingest is also known as NV-Ingest and NeMo Retriever Extraction. + NVIDIA Ingest enables parallelization of the process of splitting documents into pages where contents are classified (as tables, charts, images, text), extracted into discrete content, and further contextualized via optical character recognition (OCR) into a well defined JSON schema. From there, NVIDIA Ingest can optionally manage computation of embeddings for the extracted content, and also optionally manage storing into a vector database [Milvus](https://milvus.io/). > [!Note] @@ -37,11 +40,11 @@ NV-Ingest is a microservice service that does the following: NV-Ingest supports the following file types: +- `pdf` - `docx` +- `pptx` - `jpeg` -- `pdf` - `png` -- `pptx` - `svg` - `tiff` - `txt` @@ -84,7 +87,7 @@ To get started using NVIDIA Ingest, you need to do a few things: 4. [Inspect and consume results](#step-4-inspecting-and-consuming-results) 🔍 Optional: -1. [Direct Library Deployment](docs/docs/user-guide/developer-guide/deployment.md) 📦 +1. [Direct Library Deployment](docs/docs/user-guide/deployment.md) 📦 ### Step 1: Starting containers @@ -93,14 +96,14 @@ This example demonstrates how to use the provided [docker-compose.yaml](docker-c > [!IMPORTANT] > NIM containers on their first startup can take 10-15 minutes to pull and fully load models. -If you prefer, you can also [start services one by one](docs/docs/user-guide/developer-guide/deployment.md), or run on Kubernetes via [our Helm chart](helm/README.md). Also of note are [additional environment variables](docs/docs/user-guide/developer-guide/environment-config.md) you may wish to configure. +If you prefer, you can also [start services one by one](docs/docs/user-guide/deployment.md), or run on Kubernetes via [our Helm chart](helm/README.md). Also of note are [additional environment variables](docs/docs/user-guide/environment-config.md) you may wish to configure. 1. Git clone the repo: `git clone https://github.com/nvidia/nv-ingest` 2. Change directory to the cloned repo `cd nv-ingest`. -3. [Generate API keys](docs/docs/user-guide/developer-guide/ngc-api-key.md) and authenticate with NGC with the `docker login` command: +3. [Generate API keys](docs/docs/user-guide/ngc-api-key.md) and authenticate with NGC with the `docker login` command: ```shell # This is required to access pre-built containers and NIM microservices $ docker login nvcr.io @@ -112,7 +115,7 @@ Password: > During the early access (EA) phase, you must apply for early access here: https://developer.nvidia.com/nemo-microservices-early-access/join. > When your early access is approved, follow the instructions in the email to create an organization and team, link your profile, and generate your NGC API key. -4. Create a .env file that contains your NGC API keys. For more information, refer to [Environment Configuration Variables](docs/docs/user-guide/developer-guide/environment-config.md). +4. Create a .env file that contains your NGC API keys. For more information, refer to [Environment Configuration Variables](docs/docs/user-guide/environment-config.md). ``` # Container images must access resources from NGC. @@ -132,7 +135,7 @@ NVIDIA_BUILD_API_KEY= > `sudo nvidia-ctk runtime configure --runtime=docker --set-as-default` > [!NOTE] -> The most accurate tokenizer based splitting depends on the [llama-3.2 tokenizer](https://huggingface.co/meta-llama/Llama-3.2-1B). To download this model at container build time, you must set `DOWNLOAD_LLAMA_TOKENIZER=True` _and_ supply an authorized HuggingFace access token via `HF_ACCESS_TOKEN=`. If not, the ungated [e5-large-unsupervised](https://huggingface.co/intfloat/e5-large-unsupervised) tokenizer model will be downloaded instead. By default, the split task will use whichever model has been predownloaded. Refer to [Environment Configuration Variables](docs/docs/user-guide/developer-guide/environment-config.md) for more info. +> The most accurate tokenizer based splitting depends on the [llama-3.2 tokenizer](https://huggingface.co/meta-llama/Llama-3.2-1B). To download this model at container build time, you must set `DOWNLOAD_LLAMA_TOKENIZER=True` _and_ supply an authorized HuggingFace access token via `HF_ACCESS_TOKEN=`. If not, the ungated [e5-large-unsupervised](https://huggingface.co/intfloat/e5-large-unsupervised) tokenizer model will be downloaded instead. By default, the split task will use whichever model has been predownloaded. Refer to [Environment Configuration Variables](docs/docs/user-guide/environment-config.md) for more info. 5. Start all services: `docker compose --profile retrieval up` @@ -188,7 +191,7 @@ ac27e5297d57 prom/prometheus:latest > > After the image builds, run `docker compose --profile retrieval up` or `docker compose up --build` as explained in the previous step. -### Step 2: Installing Python dependencies +### Step 2: Install Python dependencies To interact with the nv-ingest service, you can do so from the host, or by `docker exec`-ing into the nv-ingest container. @@ -225,7 +228,7 @@ pip install . ### Step 3: Ingesting Documents -You can submit jobs programmatically in Python or via the nv-ingest-cli tool. +You can submit jobs programmatically in Python or via the [NV-Ingest CLI](nv-ingest_cli.md). In the below examples, we are doing text, chart, table, and image extraction: @@ -359,7 +362,7 @@ multimodal_test.pdf.metadata.json processed_docs/text: multimodal_test.pdf.metadata.json ``` -For the full metadata definitions, refer to [Content Metadata](/docs/docs/user-guide/developer-guide/content-metadata.md). +For the full metadata definitions, refer to [Content Metadata](/docs/docs/user-guide/content-metadata.md). #### We also provide a script for inspecting [extracted images](src/util/image_viewer.py) diff --git a/client/README.md b/client/README.md index adfb6627..0e24921a 100644 --- a/client/README.md +++ b/client/README.md @@ -8,6 +8,10 @@ SPDX-License-Identifier: Apache-2.0 NV-Ingest-Client is a tool designed for efficient ingestion and processing of large datasets. It provides both a Python API and a command-line interface to cater to various ingestion needs. +> [!Note] +> NV-Ingest is also known as NVIDIA Ingest and NeMo Retriever Extraction. + + ## Table of Contents 1. [Installation](#installation) diff --git a/docs/Makefile b/docs/Makefile index fb5dd63c..46ebf474 100644 --- a/docs/Makefile +++ b/docs/Makefile @@ -5,7 +5,7 @@ # Define paths SPHINX_BUILD_DIR=sphinx_docs/build SPHINX_SOURCE_DIR=sphinx_docs/source -SPHINX_OUTPUT_DIR=docs/user-guide/developer-guide/api-docs +SPHINX_OUTPUT_DIR=docs/user-guide/api-docs # Default target .PHONY: all diff --git a/docs/docs/SUMMARY.md b/docs/docs/SUMMARY.md deleted file mode 100644 index 2689c887..00000000 --- a/docs/docs/SUMMARY.md +++ /dev/null @@ -1,2 +0,0 @@ -- [Home](index.md) -- [User Guide](user-guide/) diff --git a/docs/docs/user-guide/developer-guide/example_processed_docs/text/multimodal_test.pdf.metadata.json b/docs/docs/example_processed_docs/text/multimodal_test.pdf.metadata.json similarity index 100% rename from docs/docs/user-guide/developer-guide/example_processed_docs/text/multimodal_test.pdf.metadata.json rename to docs/docs/example_processed_docs/text/multimodal_test.pdf.metadata.json diff --git a/docs/docs/index.md b/docs/docs/index.md index 9c080e3e..648be762 100644 --- a/docs/docs/index.md +++ b/docs/docs/index.md @@ -3,7 +3,7 @@ hide: - navigation --- -**NV-Ingest** is a scalable, performance-oriented document content and metadata extraction microservice. NV-Ingest uses specialized NVIDIA NIM microservices to find, contextualize, and extract text, tables, charts and images for use in downstream generative applications.. You can access NV-Ingest as a free community resource or learn more about getting an enterprise license for improved expert-level support at the [NV-Ingest homepage](https://www.nvidia.com). +NeMo Retriever Extraction (NV-Ingest) is a scalable, performance-oriented document content and metadata extraction microservice. NV-Ingest uses specialized NVIDIA NIM microservices to find, contextualize, and extract text, tables, charts and images for use in downstream generative applications.. You can access NV-Ingest as a free community resource or learn more about getting an enterprise license for improved expert-level support at the [NV-Ingest homepage](https://www.nvidia.com).
@@ -14,6 +14,6 @@ hide: Install NV-Ingest and set up your environment to start accelerating your workflows. - [Get Started](user-guide){ .md-button .md-button } + [Get Started](user-guide/overview.md){ .md-button .md-button }
diff --git a/docs/docs/user-guide/SUMMARY.md b/docs/docs/user-guide/SUMMARY.md deleted file mode 100644 index d910214b..00000000 --- a/docs/docs/user-guide/SUMMARY.md +++ /dev/null @@ -1,6 +0,0 @@ -- [What is NVIDIA Ingest?](index.md) -- [Prerequisites](prerequisites.md) -- [Quickstart](quickstart-guide.md) -- [Developer Guide](developer-guide/) -- [Contributing](contributing.md) -- [Release Notes](releasenotes-nv-ingest.md) diff --git a/docs/docs/user-guide/content-metadata.md b/docs/docs/user-guide/content-metadata.md new file mode 100644 index 00000000..3b2f45e8 --- /dev/null +++ b/docs/docs/user-guide/content-metadata.md @@ -0,0 +1,102 @@ +# Source and Content Metadata Reference for NV-Ingest + +This documentation contains the reference for the content metadata. +The definitions used in this documentation are the following: + +- **Source** — The knowledge base file from which content and metadata is extracted. +- **Content** — Data extracted from a source, such as text or an image. + +Metadata can be extracted from a source or content, or generated by using models, heuristics, or other methods. + + +## Source Metadata + +The following is the metadata for sources. + +| Field | Description | Method | +|----------|----------------------------------------|----------| +| Source Name | The name of the source. | Extracted | +| Source ID | The ID of the source. | Extracted | +| Source location | The URL, URI, or pointer to the storage location of the source. | — | +| Source Type | The file type of the source, such as pdf, docx, pptx, or txt. | Extracted | +| Collection ID | The ID of the collection in which the source is contained. | — | +| Date Created | The date the source was created. | Extracted | +| Last Modified | The date the source was last modified. | Extracted | +| Partition ID | The offset of this data fragment within a larger set of fragments. | Generated | +| Access Level | The role-based access control for the source. | — | +| Summary | A summary of the source. (Not yet implemented.) | Generated | + + +## Content Metadata + +The following is the metadata for content. +These fields apply to all content types including text, images, and tables. + +| Field | Description | Method | +|----------|----------------------------------------|----------| +| Type | The type of the content. Text, Image, Structured, Table, or Chart. | Generated | +| Subtype | The type of the content for structured data types, such as table or chart. | — | +| Content | Content extracted from the source. | Extracted | +| Description | A text description of the content object. | Generated | +| Page \# | The page \# of the content in the source. | Extracted | +| Hierarchy | The location or order of the content within the source. | Extracted | + + +## Text Metadata + +The following is the metadata for text. + +| Field | Description | Method | +|----------|----------------------------------------|----------| +| Text Type | The type of the text, such as header or body. | Extracted | +| Keywords | Keywords, Named Entities, or other phrases. | Extracted | +| Language | The language of the content. | Generated | +| Summary | An abbreviated summary of the content. (Not yet implemented.) | Generated | + + +## Image Metadata + +The following is the metadata for images. + +| Field | Description | Method | +|----------|----------------------------------------|----------| +| Image Type | The type of the image, such as structured, natural, hybrid, and others. | Generated (Classifier) | +| Structured Image Type | The type of the content for structured data types, such as bar chart, pie chart, and others. | Generated (Classifier) | +| Caption | Any caption or subheading associated with Image | Extracted | +| Text | Extracted text from a structured chart | Extracted | Pending Research | +| Image location | Location (x,y) of chart within an image | Extracted | +| Image location max dimensions | Max dimensions (x\_max,y\_max) of location (x,y) | Extracted | +| uploaded\_image\_uri | Mirrors source\_metadata.source\_location | — | + + +## Table Metadata + +The following is the metadata for tables within documents. + +!!! warning + Tables should not be chunked + +| Field | Description | Method | +|----------|----------------------------------------|----------| +| Table format | Structured (dataframe / lists of rows and columns), or serialized as markdown, html, latex, simple (cells separated as spaces). | Extracted | +| Table content | Extracted text content, formatted according to table\_metadata.table\_format. | Extracted | +| Table location | The bounding box of the table. | Extracted | +| Table location max dimensions | The max dimensions (x\_max,y\_max) of the bounding box of the table. | Extracted | +| Caption | The caption for the table or chart. | Extracted | +| Title | The title of the table. | Extracted | +| Subtitle | The subtitle of the table. | Extracted | +| Axis | Axis information for the table. | Extracted | +| uploaded\_image\_uri | A mirror of source\_metadata.source\_location. | Generated | + + + diff --git a/docs/docs/user-guide/contributing.md b/docs/docs/user-guide/contributing.md index cf5af5d8..6a136c21 100644 --- a/docs/docs/user-guide/contributing.md +++ b/docs/docs/user-guide/contributing.md @@ -1,4 +1,4 @@ -# Contributing to NVIDIA-Ingest +# Contributing to NV-Ingest -External contributions to NVIDIA-Ingest will be welcome soon, and they are greatly appreciated! -For more information, refer to [Contributing to NVIDIA-Ingest](https://github.com/NVIDIA/nv-ingest/blob/main/CONTRIBUTING.md). +External contributions to NV-Ingest will be welcome soon, and they are greatly appreciated! +For more information, refer to [Contributing to NV-Ingest](https://github.com/NVIDIA/nv-ingest/blob/main/CONTRIBUTING.md). diff --git a/docs/docs/user-guide/developer-guide/deployment.md b/docs/docs/user-guide/deployment.md similarity index 99% rename from docs/docs/user-guide/developer-guide/deployment.md rename to docs/docs/user-guide/deployment.md index 8d81eec2..6164c4f2 100644 --- a/docs/docs/user-guide/developer-guide/deployment.md +++ b/docs/docs/user-guide/deployment.md @@ -1,4 +1,4 @@ -# NV-Ingest Deployment +# Deploy NV-Ingest ## Launch NVIDIA Microservice(s) diff --git a/docs/docs/user-guide/developer-guide/SUMMARY.md b/docs/docs/user-guide/developer-guide/SUMMARY.md deleted file mode 100644 index d238e7f2..00000000 --- a/docs/docs/user-guide/developer-guide/SUMMARY.md +++ /dev/null @@ -1,9 +0,0 @@ -- [Authenticating Local Docker with NGC](ngc-api-key.md) -- [Content Metadata](content-metadata.md) -- [NV-Ingest Deployment](deployment.md) -- [Environment Configuration Variables](environment-config.md) -- [Developing with Kubernetes](kubernetes-dev.md) -- [NV-Ingest Command Line (CLI)](nv-ingest_cli.md) -- [API Reference](api-docs) -- [Telemetry](telemetry.md) -- [Environment Configuration Variables](environment-config.md) diff --git a/docs/docs/user-guide/developer-guide/content-metadata.md b/docs/docs/user-guide/developer-guide/content-metadata.md deleted file mode 100644 index f4f4f702..00000000 --- a/docs/docs/user-guide/developer-guide/content-metadata.md +++ /dev/null @@ -1,61 +0,0 @@ -# Content Metadata - -**Definitions**: - -Source: The knowledge base file from which content and metadata is extracted - -Content: Data extracted from a source: Text or Image - -Metadata: Descriptive data which can be associated with Sources, Content(Image or Text); metadata can be extracted from Source/Content, or generated using models, heuristics, etc - -| | Field | Description | Method | -| ----- | :---- | :---- | :---- | -| Content | Content | Content extracted from Source | Extracted | -| Source Metadata | Source Name | Name of source | Extracted | -| | Source ID | ID of source | Extracted | -| | Source location | URL, URI, pointer to storage location | N/A | -| | Source Type | PDF, HTML, Docx, TXT, PPTx | Extracted | -| | Collection ID | Collection in which the source is contained | N/A | -| | Date Created | Date source was created | Extracted | | -| | Last Modified | Date source was last modified | Extracted | | -| | Summary | Summarization of Source Doc (Not Yet Implemented) | Generated | Pending Research | -| | Partition ID | Offset of this data fragment within a larger set of fragments | Generated | -| | Access Level | Dictates RBAC | N/A | -| Content Metadata (applicable to all content types) | Type | Text, Image, Structured, Table, Chart | Generated | -| | Description | Text Description of the content object (Image/Table) | Generated | -| | Page \# | Page \# where content is contained in source | Extracted | -| | Hierarchy | Location/order of content within the source document | Extracted | -| | Subtype | For structured data subtypes \- table, chart, etc.. | | | -| Text Metadata | Text Type | Header, body, etc | Extracted | -| | Summary | Abbreviated Summary of content (Not Yet Implemented) | Generated | Pending Research | -| | Keywords | Keywords, Named Entities, or other phrases | Extracted | N | -| | Language | | Generated | N | -| Image Metadata | Image Type | Structured, Natural,Hybrid, etc | Generated (Classifier) | Y(needs to be developed) | -| | Structured Image Type | Bar Chart, Pie Chart, etc | Generated (Classifier) | Y(needs to be developed) | -| | Caption | Any caption or subheader associated with Image | Extracted | -| | Text | Extracted text from a structured chart | Extracted | Pending Research | -| | Image location | Location (x,y) of chart within an image | Extracted | | -| | Image location max dimensions | Max dimensions (x\_max,y\_max) of location (x,y) | Extracted | | -| | uploaded\_image\_uri | Mirrors source\_metadata.source\_location | | | -| Table Metadata (tables within documents) | Table format | Structured (dataframe / lists of rows and columns), or serialized as markdown, html, latex, simple (cells separated just as spaces) | Extracted | -| | Table content | Extracted text content, formatted according to table\_metadata.table\_format. Important: Tables should not be chunked | Extracted | | -| | Table location | Bounding box of the table | Extracted | | -| | Table location max dimensions | Max dimensions (x\_max,y\_max) of bounding box of the table | Extracted | | -| | Caption | Detected captions for the table/chart | Extracted | | -| | Title | TODO | Extracted | | -| | Subtitle | TODO | Extracted | | -| | Axis | TODO | Extracted | | -| | uploaded\_image\_uri | Mirrors source\_metadata.source\_location | Generated | | - - - diff --git a/docs/docs/user-guide/developer-guide/environment-config.md b/docs/docs/user-guide/environment-config.md similarity index 98% rename from docs/docs/user-guide/developer-guide/environment-config.md rename to docs/docs/user-guide/environment-config.md index 173118e8..64c367ec 100644 --- a/docs/docs/user-guide/developer-guide/environment-config.md +++ b/docs/docs/user-guide/environment-config.md @@ -1,4 +1,4 @@ -# Environment Configuration Variables +# Environment Configuration Variables for NV-Ingest The following are the environment configuration variables that you can specify in your .env file. diff --git a/docs/docs/user-guide/developer-guide/kubernetes-dev.md b/docs/docs/user-guide/kubernetes-dev.md similarity index 99% rename from docs/docs/user-guide/developer-guide/kubernetes-dev.md rename to docs/docs/user-guide/kubernetes-dev.md index 966fa441..599b6d68 100644 --- a/docs/docs/user-guide/developer-guide/kubernetes-dev.md +++ b/docs/docs/user-guide/kubernetes-dev.md @@ -1,4 +1,4 @@ -# Developing with Kubernetes +# Developing with NV-Ingest on Kubernetes Developing directly on Kubernetes gives us more confidence that end-user deployments will work as expected. diff --git a/docs/docs/user-guide/developer-guide/ngc-api-key.md b/docs/docs/user-guide/ngc-api-key.md similarity index 89% rename from docs/docs/user-guide/developer-guide/ngc-api-key.md rename to docs/docs/user-guide/ngc-api-key.md index e99d87dd..54ef70b6 100644 --- a/docs/docs/user-guide/developer-guide/ngc-api-key.md +++ b/docs/docs/user-guide/ngc-api-key.md @@ -1,4 +1,4 @@ -# Authenticating Local Docker with NGC +# Generate Your NGC Keys ## Generate an API key @@ -12,7 +12,7 @@ When creating an NGC API key, ensure that all of the following are selected from - NGC Catalog - Private Registry -![Generate Personal Key](../../assets/images/generate_personal_key.png) +![Generate Personal Key](../assets/images/generate_personal_key.png) ### Docker Login to NGC diff --git a/docs/docs/user-guide/developer-guide/nv-ingest_cli.md b/docs/docs/user-guide/nv-ingest_cli.md similarity index 98% rename from docs/docs/user-guide/developer-guide/nv-ingest_cli.md rename to docs/docs/user-guide/nv-ingest_cli.md index 41f3a351..55fa0017 100644 --- a/docs/docs/user-guide/developer-guide/nv-ingest_cli.md +++ b/docs/docs/user-guide/nv-ingest_cli.md @@ -1,6 +1,7 @@ -# NV-Ingest Command Line (CLI) +# NV-Ingest Command Line Interface Reference -After installing the Python dependencies, you'll be able to use the nv-ingest-cli tool. +After you install the Python dependencies, you can use the NV-Ingest command line interface (CLI). +To use the CLI, use the `nv-ingest-cli` command. ```bash nv-ingest-cli --help diff --git a/docs/docs/user-guide/index.md b/docs/docs/user-guide/overview.md similarity index 70% rename from docs/docs/user-guide/index.md rename to docs/docs/user-guide/overview.md index ce48a962..b00d2dca 100644 --- a/docs/docs/user-guide/index.md +++ b/docs/docs/user-guide/overview.md @@ -1,12 +1,16 @@ -# What is NVIDIA-Ingest? +# What is NVIDIA Ingest? -NV-Ingest is a scalable, performance-oriented document content and metadata extraction microservice. -NV-Ingest uses specialized NVIDIA NIM microservices +NVIDIA Ingest is a scalable, performance-oriented document content and metadata extraction microservice. +NVIDIA Ingest uses specialized NVIDIA NIM microservices to find, contextualize, and extract text, tables, charts and images that you can use in downstream generative applications. -NV-Ingest also enables parallelization of the process of splitting documents into pages where contents are classified (such as tables, charts, images, text), +!!! note + + NVIDIA Ingest is also known as NV-Ingest and NeMo Retriever Extraction. + +NVIDIA Ingest also enables parallelization of the process of splitting documents into pages where contents are classified (such as tables, charts, images, text), extracted into discrete content, and further contextualized through optical character recognition (OCR) into a well defined JSON schema. -From there, NVIDIA-Ingest can optionally manage computation of embeddings for the extracted content, +From there, NVIDIA Ingest can optionally manage computation of embeddings for the extracted content, and optionally manage storing into a vector database [Milvus](https://milvus.io/). !!! note @@ -14,30 +18,30 @@ and optionally manage storing into a vector database [Milvus](https://milvus.io/ Cached and Deplot are deprecated. Instead, docker-compose now uses a beta version of the yolox-graphic-elements container. With this change, you should now be able to run nv-ingest on a single 80GB A100 or H100 GPU. If you want to use the old pipeline, with Cached and Deplot, use the [nv-ingest 24.12.1 release](https://github.com/NVIDIA/nv-ingest/tree/24.12.1). -## What NVIDIA-Ingest Is ✔️ +## What NVIDIA Ingest Is ✔️ -NV-Ingest is a microservice service that does the following: +NVIDIA Ingest is a microservice service that does the following: - Accept a JSON job description, containing a document payload, and a set of ingestion tasks to perform on that payload. - Allow the results of a job to be retrieved. The result is a JSON dictionary that contains a list of metadata describing objects extracted from the base document, and processing annotations and timing/trace data. - Support multiple methods of extraction for each document type to balance trade-offs between throughput and accuracy. For example, for .pdf documents, we support extraction through pdfium, Unstructured.io, and Adobe Content Extraction Services. - Support various types of pre- and post- processing operations, including text splitting and chunking, transform and filtering, embedding generation, and image offloading to storage. -NV-Ingest supports the following file types: +NVIDIA Ingest supports the following file types: +- `pdf` - `docx` +- `pptx` - `jpeg` -- `pdf` - `png` -- `pptx` - `svg` - `tiff` - `txt` -## What NVIDIA-Ingest Isn't ✖️ +## What NVIDIA Ingest Isn't ✖️ -NV-Ingest does not do the following: +NVIDIA Ingest does not do the following: - Run a static pipeline or fixed set of operations on every submitted document. - Act as a wrapper for any specific document parsing library. diff --git a/docs/docs/user-guide/prerequisites.md b/docs/docs/user-guide/prerequisites.md index 5dd8b670..0b634f3f 100644 --- a/docs/docs/user-guide/prerequisites.md +++ b/docs/docs/user-guide/prerequisites.md @@ -1,6 +1,6 @@ -# Prerequisites +# Prerequisites for NV-Ingest -Before you begin using NVIDIA-Ingest, ensure the following hardware and software prerequisites outlined are met. +Before you begin using NV-Ingest, ensure the following hardware and software prerequisites outlined are met. ## Hardware @@ -20,4 +20,4 @@ Before you begin using NVIDIA-Ingest, ensure the following hardware and software !!! note - You install Python later. NVIDIA-Ingest only supports [Python version 3.10](https://www.python.org/downloads/release/python-3100/). + You install Python later. NV-Ingest only supports [Python version 3.10](https://www.python.org/downloads/release/python-3100/). diff --git a/docs/docs/user-guide/quickstart-guide.md b/docs/docs/user-guide/quickstart-guide.md index 54b1ca9a..be78d39a 100644 --- a/docs/docs/user-guide/quickstart-guide.md +++ b/docs/docs/user-guide/quickstart-guide.md @@ -1,6 +1,6 @@ -# Quickstart Guide +# Quickstart Guide for NV-Ingest -To get started using NVIDIA-Ingest, you need to do a few things: +To get started using NV-Ingest, you need to do a few things: 1. [Start supporting NIM microservices](#step-1-starting-containers) 🏗️ 2. [Install the NVIDIA Ingest client dependencies in a Python environment](#step-2-installing-python-dependencies) 🐍 @@ -16,7 +16,7 @@ This example demonstrates how to use the provided [docker-compose.yaml](https:// NIM containers on their first startup can take 10-15 minutes to pull and fully load models. -If you prefer, you can also [start services one by one](developer-guide/deployment.md) or run on Kubernetes by using [our Helm chart](https://github.com/NVIDIA/nv-ingest/blob/main/helm/README.md). Also, there are [additional environment variables](developer-guide/environment-config.md) you want to configure. +If you prefer, you can also [start services one by one](deployment.md) or run on Kubernetes by using [our Helm chart](https://github.com/NVIDIA/nv-ingest/blob/main/helm/README.md). Also, there are [additional environment variables](environment-config.md) you want to configure. 1. Git clone the repo: @@ -26,7 +26,7 @@ If you prefer, you can also [start services one by one](developer-guide/deployme `cd nv-ingest`. -3. [Generate API keys](developer-guide/ngc-api-key.md) and authenticate with NGC with the `docker login` command: +3. [Generate API keys](ngc-api-key.md) and authenticate with NGC with the `docker login` command: ```shell # This is required to access pre-built containers and NIM microservices @@ -39,7 +39,7 @@ If you prefer, you can also [start services one by one](developer-guide/deployme During the early access (EA) phase, you must apply for early access at [https://developer.nvidia.com/nemo-microservices-early-access/join](https://developer.nvidia.com/nemo-microservices-early-access/join). When your early access is approved, follow the instructions in the email to create an organization and team, link your profile, and generate your NGC API key. -4. Create a .env file containing your NGC API key and the following paths. For more information, refer to [Environment Configuration Variables](developer-guide/environment-config.md). +4. Create a .env file containing your NGC API key and the following paths. For more information, refer to [Environment Configuration Variables](environment-config.md). ``` # Container images must access resources from NGC. @@ -110,7 +110,7 @@ If you prefer, you can also [start services one by one](developer-guide/deployme NV-Ingest is in early access (EA) mode, meaning the codebase gets frequent updates. To build an updated NV-Ingest service container with the latest changes, you can run `docker compose build`. After the image builds, run `docker compose --profile retrieval up` or `docker compose up --build` as explained in the previous step. -## Step 2: Installing Python Dependencies +## Step 2: Install Python Dependencies You can interact with the NV-Ingest service from the host or by `docker exec`-ing into the NV-Ingest container. @@ -130,7 +130,7 @@ pip install . ## Step 3: Ingesting Documents -You can submit jobs programmatically in Python or using the nv-ingest-cli tool. +You can submit jobs programmatically in Python or using the [NV-Ingest CLI](nv-ingest_cli.md). In the below examples, we are doing text, chart, table, and image extraction: @@ -266,7 +266,7 @@ processed_docs/text: multimodal_test.pdf.metadata.json ``` -For the full metadata definitions, refer to [Content Metadata](developer-guide/content-metadata.md). +For the full metadata definitions, refer to [Content Metadata](content-metadata.md). We also provide a script for inspecting [extracted images](https://github.com/NVIDIA/nv-ingest/blob/main/src/util/image_viewer.py). diff --git a/docs/docs/user-guide/releasenotes-nv-ingest.md b/docs/docs/user-guide/releasenotes-nv-ingest.md index e2b0eb11..45117ea3 100644 --- a/docs/docs/user-guide/releasenotes-nv-ingest.md +++ b/docs/docs/user-guide/releasenotes-nv-ingest.md @@ -1,4 +1,4 @@ -# NVIDIA-Ingest Release Notes +# Release Notes for NV-Ingest ## Release 24.12.1 diff --git a/docs/docs/user-guide/developer-guide/telemetry.md b/docs/docs/user-guide/telemetry.md similarity index 80% rename from docs/docs/user-guide/developer-guide/telemetry.md rename to docs/docs/user-guide/telemetry.md index 97325493..1766bac1 100644 --- a/docs/docs/user-guide/developer-guide/telemetry.md +++ b/docs/docs/user-guide/telemetry.md @@ -1,4 +1,4 @@ -# Telemetry +# Telemetry with NV-Ingest ## Docker Compose @@ -11,7 +11,7 @@ $ docker compose up otel-collector Once OpenTelemetry and Zipkin are running, you can open your browser to explore traces: http://$YOUR_DOCKER_HOST:9411/zipkin/. -![](../../assets/images/zipkin.png) +![](../assets/images/zipkin.png) To run Prometheus, run: @@ -21,4 +21,4 @@ $ docker compose up prometheus Once Promethus is running, you can open your browser to explore metrics: [http://$YOUR_DOCKER_HOST:9090/] -![](../../assets/images/prometheus.png) +![](../assets/images/prometheus.png) diff --git a/docs/mkdocs.yml b/docs/mkdocs.yml index 533e1f39..fde92f08 100644 --- a/docs/mkdocs.yml +++ b/docs/mkdocs.yml @@ -1,4 +1,4 @@ -site_name: NV-Ingest Documentation +site_name: NeMo Retriever Extraction (NV-Ingest) Documentation # site_url: repo_url: https://github.com/NVIDIA/nv-ingest # repo_name: @@ -15,6 +15,7 @@ theme: - navigation.instant.prefetch - navigation.top - navigation.footer + - navigation.expand - search.suggest - search.highlight - content.code.copy @@ -47,6 +48,26 @@ extra_css: - assets/css/custom-material.css - assets/css/jupyter-themes.css +nav: + - Home: index.md + - User Guide: + - What is NVIDIA Ingest?: user-guide/overview.md + - Release Notes: user-guide/releasenotes-nv-ingest.md + - Get Started: + - Prerequisites: user-guide/prerequisites.md + - Generate Your NGC Keys: user-guide/ngc-api-key.md + - Quickstart: user-guide/quickstart-guide.md + - Developer Guide: + - Deploy NV-Ingest: user-guide/deployment.md + - NV-Ingest on Kubernetes: user-guide/kubernetes-dev.md + - Telemetry: user-guide/telemetry.md + - Contribute: user-guide/contributing.md + - Reference: + - Content Metadata: user-guide/content-metadata.md + - Environment Variables: user-guide/environment-config.md + - CLI Reference: user-guide/nv-ingest_cli.md + - API Reference: user-guide/api-docs.md + plugins: - search # uncomment below to grab api @@ -68,8 +89,6 @@ plugins: highlight_extra_classes: "jupyter-notebook" - include_dir_to_nav: file_pattern: '.*\.(md|ipynb)$' - - literate-nav: - nav_file: SUMMARY.md - site-urls markdown_extensions: @@ -104,4 +123,4 @@ markdown_extensions: # github_url: copyright: | - © Copyright 2023-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. + © Copyright 2023-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. diff --git a/examples/langchain_multimodal_rag.ipynb b/examples/langchain_multimodal_rag.ipynb index a5b11c6c..b872b2dc 100644 --- a/examples/langchain_multimodal_rag.ipynb +++ b/examples/langchain_multimodal_rag.ipynb @@ -21,7 +21,7 @@ "id": "c6905d11-0ec3-43c8-961b-24cb52e36bfe", "metadata": {}, "source": [ - "**Note:** In order to run this notebook, you need to have the NV-Ingest microservice running along with all of the other included microservices. To do this, make sure all of the services are uncommented in the file: [docker-compose.yaml](https://github.com/NVIDIA/nv-ingest/blob/main/docker-compose.yaml) and follow the [quickstart guide](https://github.com/NVIDIA/nv-ingest?tab=readme-ov-file#quickstart) to start everything up. You also need to install the NV-Ingest python client installed as explained in [Step 2: Instal Python dependencies](https://github.com/NVIDIA/nv-ingest?tab=readme-ov-file#step-2-installing-python-dependencies)." + "**Note:** In order to run this notebook, you need to have the NV-Ingest microservice running along with all of the other included microservices. To do this, make sure all of the services are uncommented in the file: [docker-compose.yaml](https://github.com/NVIDIA/nv-ingest/blob/main/docker-compose.yaml) and follow the [quickstart guide](https://github.com/NVIDIA/nv-ingest?tab=readme-ov-file#quickstart) to start everything up. You also need to install the NV-Ingest python client installed as explained in [Step 2: Install Python dependencies](https://github.com/NVIDIA/nv-ingest?tab=readme-ov-file#step-2-installing-python-dependencies)." ] }, { diff --git a/examples/llama_index_multimodal_rag.ipynb b/examples/llama_index_multimodal_rag.ipynb index dc2578b1..280954d5 100644 --- a/examples/llama_index_multimodal_rag.ipynb +++ b/examples/llama_index_multimodal_rag.ipynb @@ -21,7 +21,7 @@ "id": "c65edc4b-2084-47c9-a837-733264201802", "metadata": {}, "source": [ - "**Note:** In order to run this notebook, you need to have the NV-Ingest microservice running along with all of the other included microservices. To do this, make sure all of the services are uncommented in the file: [docker-compose.yaml](https://github.com/NVIDIA/nv-ingest/blob/main/docker-compose.yaml) and follow the [quickstart guide](https://github.com/NVIDIA/nv-ingest?tab=readme-ov-file#quickstart) to start everything up. You also need to install the NV-Ingest python client installed as explained in [Step 2: Instal Python dependencies](https://github.com/NVIDIA/nv-ingest?tab=readme-ov-file#step-2-installing-python-dependencies)." + "**Note:** In order to run this notebook, you need to have the NV-Ingest microservice running along with all of the other included microservices. To do this, make sure all of the services are uncommented in the file: [docker-compose.yaml](https://github.com/NVIDIA/nv-ingest/blob/main/docker-compose.yaml) and follow the [quickstart guide](https://github.com/NVIDIA/nv-ingest?tab=readme-ov-file#quickstart) to start everything up. You also need to install the NV-Ingest python client installed as explained in [Step 2: Install Python dependencies](https://github.com/NVIDIA/nv-ingest?tab=readme-ov-file#step-2-installing-python-dependencies)." ] }, { diff --git a/examples/store_and_display_images.ipynb b/examples/store_and_display_images.ipynb index 8410e643..9d43c5a8 100644 --- a/examples/store_and_display_images.ipynb +++ b/examples/store_and_display_images.ipynb @@ -21,7 +21,7 @@ "id": "2a598d15-adf0-406a-95c6-6d49c0939508", "metadata": {}, "source": [ - "**Note:** In order to run this notebook, you need to have the NV-Ingest microservice running along with all of the other included microservices. To do this, make sure all of the services are uncommented in the file: [docker-compose.yaml](https://github.com/NVIDIA/nv-ingest/blob/main/docker-compose.yaml) and follow the [quickstart guide](https://github.com/NVIDIA/nv-ingest?tab=readme-ov-file#quickstart) to start everything up. You also need to install the NV-Ingest python client installed as explained in [Step 2: Instal Python dependencies](https://github.com/NVIDIA/nv-ingest?tab=readme-ov-file#step-2-installing-python-dependencies)." + "**Note:** In order to run this notebook, you need to have the NV-Ingest microservice running along with all of the other included microservices. To do this, make sure all of the services are uncommented in the file: [docker-compose.yaml](https://github.com/NVIDIA/nv-ingest/blob/main/docker-compose.yaml) and follow the [quickstart guide](https://github.com/NVIDIA/nv-ingest?tab=readme-ov-file#quickstart) to start everything up. You also need to install the NV-Ingest python client installed as explained in [Step 2: Install Python dependencies](https://github.com/NVIDIA/nv-ingest?tab=readme-ov-file#step-2-installing-python-dependencies)." ] }, { diff --git a/helm/README.md b/helm/README.md index 51764cef..cfe6d2b0 100644 --- a/helm/README.md +++ b/helm/README.md @@ -1,4 +1,10 @@ -# NVIDIA-Ingest Helm Charts +# NV-Ingest Helm Charts + +This documentation contains documentation for the NV-Ingest Helm charts. + +> [!Note] +> NV-Ingest is also known as NVIDIA Ingest and NeMo Retriever Extraction. + > [!WARNING] > NV-Ingest version 24.08 exposed Redis directly to the client, as such setup for the [24.08](https://github.com/NVIDIA/nv-ingest/releases/tag/24.08) `nv-ingest-cli` differs. @@ -9,6 +15,7 @@ ## Prerequisites ### Hardware/Software + [Refer to our supported hardware/software configurations here](https://github.com/NVIDIA/nv-ingest?tab=readme-ov-file#hardware). ## Setup Environment diff --git a/skaffold/README.md b/skaffold/README.md index 02d57ecd..7933f9ba 100644 --- a/skaffold/README.md +++ b/skaffold/README.md @@ -1,7 +1,8 @@ -# Skaffold - Dev Team +# Skaffold - NV-Ingest Development Team Only -Skaffold is intended to support the nv-ingest development team with kubernetes dev and testing. It is not meant to be used in production deployments or even for local testing. -We offer k8s support via Helm and those instructions can be found at [Helm Documentation](../helm/README.md). +Skaffold is intended to support the NV-Ingest development team with Kubernetes development and testing. It is not meant to be used in production deployments nor for local testing. -For developers further documentation for using Skaffold can be found at [Skaffold Documentation](/docs/docs/user-guide/developer-guide/kubernetes-dev.md). +We offer Kubernetes support through Helm and you can find those instructions at [Helm Documentation](../helm/README.md). + +Development team members can find Skaffold documentation at [Skaffold Documentation](/docs/docs/user-guide/developer-guide/kubernetes-dev.md).