NVIDIA · randerzander · Feb 27, 2025 · Feb 27, 2025 · Feb 27, 2025 · Feb 27, 2025
@@ -19,6 +19,9 @@ limitations under the License.
 
 The nv-ingest devcontainer is provided as a quick-to-set-up development and exploration environment for use with [Visual Studio Code](https://code.visualstudio.com) (Code). The devcontainer is a lightweight container which mounts-in a Conda environment with cached packages, alleviating long Conda download times on subsequent launches. It provides a simple framework for adding developer-centric [scripts](#development-scripts), and incorporates some helpful Code plugins.
 
+> [!Note]
+> NV-Ingest is also known as NVIDIA Ingest and NeMo Retriever Extraction.
+
 More information about devcontainers can be found at [`containers.dev`](https://containers.dev/).
 
 ## Getting Started

@@ -80,7 +80,7 @@ git submodule update --init --recursive
    ** [Create a pull request](https://github.com/NVIDIA/nv-ingest/pulls) once your
    code is ready.
 5. **Code Review:** Wait for the review by other developers and make necessary updates.
-6. **Merge:** Once approved, an NV-Ingest developer will approve your pull request.
+6. **Merge:** After approval, an NVIDIA developer will approve your pull request.
 
 ### Seasoned Developers
 

@@ -8,6 +8,9 @@ SPDX-License-Identifier: Apache-2.0
 
 NVIDIA-Ingest is a scalable, performance-oriented document content and metadata extraction microservice. Including support for parsing PDFs, Word and PowerPoint documents, it uses specialized NVIDIA NIM microservices to find, contextualize, and extract text, tables, charts and images for use in downstream generative applications.
 
+> [!Note]
+> NVIDIA Ingest is also known as NV-Ingest and NeMo Retriever Extraction.
+
 NVIDIA Ingest enables parallelization of the process of splitting documents into pages where contents are classified (as tables, charts, images, text), extracted into discrete content, and further contextualized via optical character recognition (OCR) into a well defined JSON schema. From there, NVIDIA Ingest can optionally manage computation of embeddings for the extracted content, and also optionally manage storing into a vector database [Milvus](https://milvus.io/).
 
 > [!Note]
@@ -37,11 +40,11 @@ NV-Ingest is a microservice service that does the following:
 
 NV-Ingest supports the following file types:
 
+- `pdf`
 - `docx`
+- `pptx`
 - `jpeg`
-- `pdf`
 - `png`
-- `pptx`
 - `svg`
 - `tiff`
 - `txt`
@@ -84,7 +87,7 @@ To get started using NVIDIA Ingest, you need to do a few things:
 4. [Inspect and consume results](#step-4-inspecting-and-consuming-results) 🔍
 
 Optional:
-1. [Direct Library Deployment](docs/docs/user-guide/developer-guide/deployment.md) 📦
+1. [Direct Library Deployment](docs/docs/user-guide/deployment.md) 📦
 
 ### Step 1: Starting containers
 
@@ -93,14 +96,14 @@ This example demonstrates how to use the provided [docker-compose.yaml](docker-c
 > [!IMPORTANT]
 > NIM containers on their first startup can take 10-15 minutes to pull and fully load models.
 
-If you prefer, you can also [start services one by one](docs/docs/user-guide/developer-guide/deployment.md), or run on Kubernetes via [our Helm chart](helm/README.md). Also of note are [additional environment variables](docs/docs/user-guide/developer-guide/environment-config.md) you may wish to configure.
+If you prefer, you can also [start services one by one](docs/docs/user-guide/deployment.md), or run on Kubernetes via [our Helm chart](helm/README.md). Also of note are [additional environment variables](docs/docs/user-guide/environment-config.md) you may wish to configure.
 
 1. Git clone the repo:
 `git clone https://github.com/nvidia/nv-ingest`
 2. Change directory to the cloned repo
 `cd nv-ingest`.
 
-3. [Generate API keys](docs/docs/user-guide/developer-guide/ngc-api-key.md) and authenticate with NGC with the `docker login` command:
+3. [Generate API keys](docs/docs/user-guide/ngc-api-key.md) and authenticate with NGC with the `docker login` command:
 ```shell
 # This is required to access pre-built containers and NIM microservices
 $ docker login nvcr.io
@@ -112,7 +115,7 @@ Password: <Your Key>
 > During the early access (EA) phase, you must apply for early access here: https://developer.nvidia.com/nemo-microservices-early-access/join.
 > When your early access is approved, follow the instructions in the email to create an organization and team, link your profile, and generate your NGC API key.
 
-4. Create a .env file that contains your NGC API keys. For more information, refer to [Environment Configuration Variables](docs/docs/user-guide/developer-guide/environment-config.md).
+4. Create a .env file that contains your NGC API keys. For more information, refer to [Environment Configuration Variables](docs/docs/user-guide/environment-config.md).
 
 ```
 # Container images must access resources from NGC.
@@ -132,7 +135,7 @@ NVIDIA_BUILD_API_KEY=<key to use NIMs that are hosted on build.nvidia.com>
 > `sudo nvidia-ctk runtime configure --runtime=docker --set-as-default`
 
 > [!NOTE]
-> The most accurate tokenizer based splitting depends on the [llama-3.2 tokenizer](https://huggingface.co/meta-llama/Llama-3.2-1B). To download this model at container build time, you must set `DOWNLOAD_LLAMA_TOKENIZER=True` _and_ supply an authorized HuggingFace access token via `HF_ACCESS_TOKEN=<your access token>`. If not, the ungated [e5-large-unsupervised](https://huggingface.co/intfloat/e5-large-unsupervised) tokenizer model will be downloaded instead. By default, the split task will use whichever model has been predownloaded. Refer to [Environment Configuration Variables](docs/docs/user-guide/developer-guide/environment-config.md) for more info.
+> The most accurate tokenizer based splitting depends on the [llama-3.2 tokenizer](https://huggingface.co/meta-llama/Llama-3.2-1B). To download this model at container build time, you must set `DOWNLOAD_LLAMA_TOKENIZER=True` _and_ supply an authorized HuggingFace access token via `HF_ACCESS_TOKEN=<your access token>`. If not, the ungated [e5-large-unsupervised](https://huggingface.co/intfloat/e5-large-unsupervised) tokenizer model will be downloaded instead. By default, the split task will use whichever model has been predownloaded. Refer to [Environment Configuration Variables](docs/docs/user-guide/environment-config.md) for more info.
 
 5. Start all services:
 `docker compose --profile retrieval up`
@@ -188,7 +191,7 @@ ac27e5297d57   prom/prometheus:latest
 >
 > After the image builds, run `docker compose --profile retrieval up` or `docker compose up --build` as explained in the previous step.
 
-### Step 2: Installing Python dependencies
+### Step 2: Install Python dependencies
 
 To interact with the nv-ingest service, you can do so from the host, or by `docker exec`-ing into the nv-ingest container.
 
@@ -225,7 +228,7 @@ pip install .
 
 ### Step 3: Ingesting Documents
 
-You can submit jobs programmatically in Python or via the nv-ingest-cli tool.
+You can submit jobs programmatically in Python or via the [NV-Ingest CLI](nv-ingest_cli.md).
 
 In the below examples, we are doing text, chart, table, and image extraction:
 
@@ -359,7 +362,7 @@ multimodal_test.pdf.metadata.json
 processed_docs/text:
 multimodal_test.pdf.metadata.json
 ```
-For the full metadata definitions, refer to [Content Metadata](/docs/docs/user-guide/developer-guide/content-metadata.md).
+For the full metadata definitions, refer to [Content Metadata](/docs/docs/user-guide/content-metadata.md).
 
 #### We also provide a script for inspecting [extracted images](src/util/image_viewer.py)
 

@@ -8,6 +8,10 @@ SPDX-License-Identifier: Apache-2.0
 
 NV-Ingest-Client is a tool designed for efficient ingestion and processing of large datasets. It provides both a Python API and a command-line interface to cater to various ingestion needs.
 
+> [!Note]
+> NV-Ingest is also known as NVIDIA Ingest and NeMo Retriever Extraction.
+
+
 ## Table of Contents
 
 1. [Installation](#installation)

@@ -5,7 +5,7 @@
 # Define paths
 SPHINX_BUILD_DIR=sphinx_docs/build
 SPHINX_SOURCE_DIR=sphinx_docs/source
-SPHINX_OUTPUT_DIR=docs/user-guide/developer-guide/api-docs
+SPHINX_OUTPUT_DIR=docs/user-guide/api-docs
 
 # Default target
 .PHONY: all

@@ -3,7 +3,7 @@ hide:
   - navigation
 ---
 
-**NV-Ingest** is a scalable, performance-oriented document content and metadata extraction microservice. NV-Ingest uses specialized NVIDIA NIM microservices to find, contextualize, and extract text, tables, charts and images for use in downstream generative applications.. You can access NV-Ingest as a free community resource or learn more about getting an enterprise license for improved expert-level support at the [NV-Ingest homepage](https://www.nvidia.com).
+NeMo Retriever Extraction (NV-Ingest) is a scalable, performance-oriented document content and metadata extraction microservice. NV-Ingest uses specialized NVIDIA NIM microservices to find, contextualize, and extract text, tables, charts and images for use in downstream generative applications.. You can access NV-Ingest as a free community resource or learn more about getting an enterprise license for improved expert-level support at the [NV-Ingest homepage](https://www.nvidia.com).
 
 
 <div class="grid cards" markdown>
@@ -14,6 +14,6 @@ hide:
 
     Install NV-Ingest and set up your environment to start accelerating your workflows.
 
-    [Get Started](user-guide){ .md-button .md-button }
+    [Get Started](user-guide/overview.md){ .md-button .md-button }
 
 </div>
@@ -0,0 +1,102 @@
+# Source and Content Metadata Reference for NV-Ingest
+
+This documentation contains the reference for the content metadata. 
+The definitions used in this documentation are the following:
+
+- **Source** — The knowledge base file from which content and metadata is extracted.
+- **Content** — Data extracted from a source, such as text or an image.
+
+Metadata can be extracted from a source or content, or generated by using models, heuristics, or other methods.
+
+
+## Source Metadata
+
+The following is the metadata for sources.
+
+| Field    | Description | Method |
+|----------|----------------------------------------|----------|
+| Source Name | The name of the source. | Extracted |
+| Source ID | The ID of the source.  | Extracted |
+| Source location | The URL, URI, or pointer to the storage location of the source. | —  |
+| Source Type | The file type of the source, such as pdf, docx, pptx, or txt. | Extracted |
+| Collection ID | The ID of the collection in which the source is contained. | — |
+| Date Created | The date the source was created. | Extracted |
+| Last Modified | The date the source was last modified. | Extracted |
+| Partition ID | The offset of this data fragment within a larger set of fragments. | Generated |
+| Access Level | The role-based access control for the source. | — |
+| Summary | A summary of the source. (Not yet implemented.) | Generated |
+
+
+## Content Metadata
+
+The following is the metadata for content. 
+These fields apply to all content types including text, images, and tables.
+
+| Field    | Description | Method |
+|----------|----------------------------------------|----------|
+| Type | The type of the content. Text, Image, Structured, Table, or Chart. | Generated |
+| Subtype | The type of the content for structured data types, such as table or chart. | — |
+| Content | Content extracted from the source.  | Extracted |
+| Description | A text description of the content object. | Generated |
+| Page \# | The page \# of the content in the source. | Extracted |
+| Hierarchy | The location or order of the content within the source.  | Extracted |
+
+
+## Text Metadata
+
+The following is the metadata for text.
+
+| Field    | Description | Method |
+|----------|----------------------------------------|----------|
+| Text Type | The type of the text, such as header or body. | Extracted |
+| Keywords | Keywords, Named Entities, or other phrases.  | Extracted |
+| Language | The language of the content. | Generated |
+| Summary | An abbreviated summary of the content. (Not yet implemented.) | Generated |
+
+
+## Image Metadata
+
+The following is the metadata for images.
+
+| Field    | Description | Method |
+|----------|----------------------------------------|----------|
+| Image Type | The type of the image, such as structured, natural, hybrid, and others. | Generated (Classifier) |
+| Structured Image Type | The type of the content for structured data types, such as bar chart, pie chart, and others. | Generated (Classifier) |
+| Caption | Any caption or subheading associated with Image | Extracted |
+| Text | Extracted text from a structured chart | Extracted | Pending Research |
+| Image location | Location (x,y) of chart within an image | Extracted |
+| Image location max dimensions | Max dimensions (x\_max,y\_max) of location (x,y) | Extracted |
+| uploaded\_image\_uri | Mirrors source\_metadata.source\_location | — |
+
+
+## Table Metadata
+
+The following is the metadata for tables within documents.
+
+!!! warning 
+    Tables should not be chunked
+
+| Field    | Description | Method |
+|----------|----------------------------------------|----------|
+| Table format | Structured (dataframe / lists of rows and columns), or serialized as markdown, html, latex, simple (cells separated as spaces). | Extracted |
+| Table content | Extracted text content, formatted according to table\_metadata.table\_format. | Extracted |
+| Table location | The bounding box of the table. | Extracted |
+| Table location max dimensions | The max dimensions (x\_max,y\_max) of the bounding box of the table.  | Extracted |
+| Caption | The caption for the table or chart. | Extracted |
+| Title | The title of the table. | Extracted |
+| Subtitle | The subtitle of the table. | Extracted |
+| Axis | Axis information for the table. | Extracted |
+| uploaded\_image\_uri | A mirror of source\_metadata.source\_location. | Generated |
+
+
+<!--
+2025-01-23 NKM: Commenting out this section
+I can find only the first (text) file, and it is empty
+I can't find the other 2 files (images, charts and tables) at all
+If we get the files, we can add this back
+
+## Example Text Extracts for multimodal_test.pdf:
+1. [text](example_processed_docs/text/multimodal_test.pdf.metadata.json)
+2. [images](example_processed_docs/image/multimodal_test.pdf.metadata.json)
+3. [charts and tables](example_processed_docs/structured/multimodal_test.pdf.metadata.json)
+-->
@@ -1,4 +1,4 @@
-# Contributing to NVIDIA-Ingest
+# Contributing to NV-Ingest
 
-External contributions to NVIDIA-Ingest will be welcome soon, and they are greatly appreciated! 
-For more information, refer to [Contributing to NVIDIA-Ingest](https://github.com/NVIDIA/nv-ingest/blob/main/CONTRIBUTING.md).
+External contributions to NV-Ingest will be welcome soon, and they are greatly appreciated! 
+For more information, refer to [Contributing to NV-Ingest](https://github.com/NVIDIA/nv-ingest/blob/main/CONTRIBUTING.md).
@@ -1,4 +1,4 @@
-# NV-Ingest Deployment
+# Deploy NV-Ingest
 
 ## Launch NVIDIA Microservice(s)
 

@@ -1,4 +1,4 @@
-# Environment Configuration Variables
+# Environment Configuration Variables for NV-Ingest
 
 The following are the environment configuration variables that you can specify in your .env file.
 

@@ -1,4 +1,4 @@
-# Developing with Kubernetes
+# Developing with NV-Ingest on Kubernetes
 
 Developing directly on Kubernetes gives us more confidence that end-user deployments will work as expected.