Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Table of Contents, update metadata page, add name note, fix a few names and typos #495

Merged
merged 3 commits into from
Feb 27, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .devcontainer/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,9 @@ limitations under the License.

The nv-ingest devcontainer is provided as a quick-to-set-up development and exploration environment for use with [Visual Studio Code](https://code.visualstudio.com) (Code). The devcontainer is a lightweight container which mounts-in a Conda environment with cached packages, alleviating long Conda download times on subsequent launches. It provides a simple framework for adding developer-centric [scripts](#development-scripts), and incorporates some helpful Code plugins.

> [!Note]
> NV-Ingest is also known as NVIDIA Ingest and NeMo Retriever Extraction.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isabel emphasized we use "e" instead of "E" in Extraction.
So it would be NeMo Retriever extraction.


More information about devcontainers can be found at [`containers.dev`](https://containers.dev/).

## Getting Started
Expand Down
2 changes: 1 addition & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ git submodule update --init --recursive
** [Create a pull request](https://github.com/NVIDIA/nv-ingest/pulls) once your
code is ready.
5. **Code Review:** Wait for the review by other developers and make necessary updates.
6. **Merge:** Once approved, an NV-Ingest developer will approve your pull request.
6. **Merge:** After approval, an NVIDIA developer will approve your pull request.

### Seasoned Developers

Expand Down
23 changes: 13 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,9 @@ SPDX-License-Identifier: Apache-2.0

NVIDIA-Ingest is a scalable, performance-oriented document content and metadata extraction microservice. Including support for parsing PDFs, Word and PowerPoint documents, it uses specialized NVIDIA NIM microservices to find, contextualize, and extract text, tables, charts and images for use in downstream generative applications.

> [!Note]
> NVIDIA Ingest is also known as NV-Ingest and NeMo Retriever Extraction.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here - NeMo Retriever extraction


NVIDIA Ingest enables parallelization of the process of splitting documents into pages where contents are classified (as tables, charts, images, text), extracted into discrete content, and further contextualized via optical character recognition (OCR) into a well defined JSON schema. From there, NVIDIA Ingest can optionally manage computation of embeddings for the extracted content, and also optionally manage storing into a vector database [Milvus](https://milvus.io/).

> [!Note]
Expand Down Expand Up @@ -37,11 +40,11 @@ NV-Ingest is a microservice service that does the following:

NV-Ingest supports the following file types:

- `pdf`
- `docx`
- `pptx`
- `jpeg`
- `pdf`
- `png`
- `pptx`
- `svg`
- `tiff`
- `txt`
Expand Down Expand Up @@ -84,7 +87,7 @@ To get started using NVIDIA Ingest, you need to do a few things:
4. [Inspect and consume results](#step-4-inspecting-and-consuming-results) 🔍

Optional:
1. [Direct Library Deployment](docs/docs/user-guide/developer-guide/deployment.md) 📦
1. [Direct Library Deployment](docs/docs/user-guide/deployment.md) 📦

### Step 1: Starting containers

Expand All @@ -93,14 +96,14 @@ This example demonstrates how to use the provided [docker-compose.yaml](docker-c
> [!IMPORTANT]
> NIM containers on their first startup can take 10-15 minutes to pull and fully load models.

If you prefer, you can also [start services one by one](docs/docs/user-guide/developer-guide/deployment.md), or run on Kubernetes via [our Helm chart](helm/README.md). Also of note are [additional environment variables](docs/docs/user-guide/developer-guide/environment-config.md) you may wish to configure.
If you prefer, you can also [start services one by one](docs/docs/user-guide/deployment.md), or run on Kubernetes via [our Helm chart](helm/README.md). Also of note are [additional environment variables](docs/docs/user-guide/environment-config.md) you may wish to configure.

1. Git clone the repo:
`git clone https://github.com/nvidia/nv-ingest`
2. Change directory to the cloned repo
`cd nv-ingest`.

3. [Generate API keys](docs/docs/user-guide/developer-guide/ngc-api-key.md) and authenticate with NGC with the `docker login` command:
3. [Generate API keys](docs/docs/user-guide/ngc-api-key.md) and authenticate with NGC with the `docker login` command:
```shell
# This is required to access pre-built containers and NIM microservices
$ docker login nvcr.io
Expand All @@ -112,7 +115,7 @@ Password: <Your Key>
> During the early access (EA) phase, you must apply for early access here: https://developer.nvidia.com/nemo-microservices-early-access/join.
> When your early access is approved, follow the instructions in the email to create an organization and team, link your profile, and generate your NGC API key.

4. Create a .env file that contains your NGC API keys. For more information, refer to [Environment Configuration Variables](docs/docs/user-guide/developer-guide/environment-config.md).
4. Create a .env file that contains your NGC API keys. For more information, refer to [Environment Configuration Variables](docs/docs/user-guide/environment-config.md).

```
# Container images must access resources from NGC.
Expand All @@ -132,7 +135,7 @@ NVIDIA_BUILD_API_KEY=<key to use NIMs that are hosted on build.nvidia.com>
> `sudo nvidia-ctk runtime configure --runtime=docker --set-as-default`

> [!NOTE]
> The most accurate tokenizer based splitting depends on the [llama-3.2 tokenizer](https://huggingface.co/meta-llama/Llama-3.2-1B). To download this model at container build time, you must set `DOWNLOAD_LLAMA_TOKENIZER=True` _and_ supply an authorized HuggingFace access token via `HF_ACCESS_TOKEN=<your access token>`. If not, the ungated [e5-large-unsupervised](https://huggingface.co/intfloat/e5-large-unsupervised) tokenizer model will be downloaded instead. By default, the split task will use whichever model has been predownloaded. Refer to [Environment Configuration Variables](docs/docs/user-guide/developer-guide/environment-config.md) for more info.
> The most accurate tokenizer based splitting depends on the [llama-3.2 tokenizer](https://huggingface.co/meta-llama/Llama-3.2-1B). To download this model at container build time, you must set `DOWNLOAD_LLAMA_TOKENIZER=True` _and_ supply an authorized HuggingFace access token via `HF_ACCESS_TOKEN=<your access token>`. If not, the ungated [e5-large-unsupervised](https://huggingface.co/intfloat/e5-large-unsupervised) tokenizer model will be downloaded instead. By default, the split task will use whichever model has been predownloaded. Refer to [Environment Configuration Variables](docs/docs/user-guide/environment-config.md) for more info.

5. Start all services:
`docker compose --profile retrieval up`
Expand Down Expand Up @@ -188,7 +191,7 @@ ac27e5297d57 prom/prometheus:latest
>
> After the image builds, run `docker compose --profile retrieval up` or `docker compose up --build` as explained in the previous step.

### Step 2: Installing Python dependencies
### Step 2: Install Python dependencies

To interact with the nv-ingest service, you can do so from the host, or by `docker exec`-ing into the nv-ingest container.

Expand Down Expand Up @@ -225,7 +228,7 @@ pip install .

### Step 3: Ingesting Documents

You can submit jobs programmatically in Python or via the nv-ingest-cli tool.
You can submit jobs programmatically in Python or via the [NV-Ingest CLI](nv-ingest_cli.md).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leave as nv-ingest-cli


In the below examples, we are doing text, chart, table, and image extraction:

Expand Down Expand Up @@ -359,7 +362,7 @@ multimodal_test.pdf.metadata.json
processed_docs/text:
multimodal_test.pdf.metadata.json
```
For the full metadata definitions, refer to [Content Metadata](/docs/docs/user-guide/developer-guide/content-metadata.md).
For the full metadata definitions, refer to [Content Metadata](/docs/docs/user-guide/content-metadata.md).

#### We also provide a script for inspecting [extracted images](src/util/image_viewer.py)

Expand Down
4 changes: 4 additions & 0 deletions client/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,10 @@ SPDX-License-Identifier: Apache-2.0

NV-Ingest-Client is a tool designed for efficient ingestion and processing of large datasets. It provides both a Python API and a command-line interface to cater to various ingestion needs.

> [!Note]
> NV-Ingest is also known as NVIDIA Ingest and NeMo Retriever Extraction.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here - NeMo Retriever extraction


## Table of Contents

1. [Installation](#installation)
Expand Down
2 changes: 1 addition & 1 deletion docs/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
# Define paths
SPHINX_BUILD_DIR=sphinx_docs/build
SPHINX_SOURCE_DIR=sphinx_docs/source
SPHINX_OUTPUT_DIR=docs/user-guide/developer-guide/api-docs
SPHINX_OUTPUT_DIR=docs/user-guide/api-docs

# Default target
.PHONY: all
Expand Down
2 changes: 0 additions & 2 deletions docs/docs/SUMMARY.md

This file was deleted.

4 changes: 2 additions & 2 deletions docs/docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ hide:
- navigation
---

**NV-Ingest** is a scalable, performance-oriented document content and metadata extraction microservice. NV-Ingest uses specialized NVIDIA NIM microservices to find, contextualize, and extract text, tables, charts and images for use in downstream generative applications.. You can access NV-Ingest as a free community resource or learn more about getting an enterprise license for improved expert-level support at the [NV-Ingest homepage](https://www.nvidia.com).
NeMo Retriever Extraction (NV-Ingest) is a scalable, performance-oriented document content and metadata extraction microservice. NV-Ingest uses specialized NVIDIA NIM microservices to find, contextualize, and extract text, tables, charts and images for use in downstream generative applications.. You can access NV-Ingest as a free community resource or learn more about getting an enterprise license for improved expert-level support at the [NV-Ingest homepage](https://www.nvidia.com).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the URL correct here? It says NV-Ingest homepage but takes the user to nvidia.com



<div class="grid cards" markdown>
Expand All @@ -14,6 +14,6 @@ hide:

Install NV-Ingest and set up your environment to start accelerating your workflows.

[Get Started](user-guide){ .md-button .md-button }
[Get Started](user-guide/overview.md){ .md-button .md-button }

</div>
6 changes: 0 additions & 6 deletions docs/docs/user-guide/SUMMARY.md

This file was deleted.

102 changes: 102 additions & 0 deletions docs/docs/user-guide/content-metadata.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
# Source and Content Metadata Reference for NV-Ingest

This documentation contains the reference for the content metadata.
The definitions used in this documentation are the following:

- **Source** — The knowledge base file from which content and metadata is extracted.
- **Content** — Data extracted from a source, such as text or an image.

Metadata can be extracted from a source or content, or generated by using models, heuristics, or other methods.


## Source Metadata

The following is the metadata for sources.

| Field | Description | Method |
|----------|----------------------------------------|----------|
| Source Name | The name of the source. | Extracted |
| Source ID | The ID of the source. | Extracted |
| Source location | The URL, URI, or pointer to the storage location of the source. | — |
| Source Type | The file type of the source, such as pdf, docx, pptx, or txt. | Extracted |
| Collection ID | The ID of the collection in which the source is contained. | — |
| Date Created | The date the source was created. | Extracted |
| Last Modified | The date the source was last modified. | Extracted |
| Partition ID | The offset of this data fragment within a larger set of fragments. | Generated |
| Access Level | The role-based access control for the source. | — |
| Summary | A summary of the source. (Not yet implemented.) | Generated |


## Content Metadata

The following is the metadata for content.
These fields apply to all content types including text, images, and tables.

| Field | Description | Method |
|----------|----------------------------------------|----------|
| Type | The type of the content. Text, Image, Structured, Table, or Chart. | Generated |
| Subtype | The type of the content for structured data types, such as table or chart. | — |
| Content | Content extracted from the source. | Extracted |
| Description | A text description of the content object. | Generated |
| Page \# | The page \# of the content in the source. | Extracted |
| Hierarchy | The location or order of the content within the source. | Extracted |


## Text Metadata

The following is the metadata for text.

| Field | Description | Method |
|----------|----------------------------------------|----------|
| Text Type | The type of the text, such as header or body. | Extracted |
| Keywords | Keywords, Named Entities, or other phrases. | Extracted |
| Language | The language of the content. | Generated |
| Summary | An abbreviated summary of the content. (Not yet implemented.) | Generated |


## Image Metadata

The following is the metadata for images.

| Field | Description | Method |
|----------|----------------------------------------|----------|
| Image Type | The type of the image, such as structured, natural, hybrid, and others. | Generated (Classifier) |
| Structured Image Type | The type of the content for structured data types, such as bar chart, pie chart, and others. | Generated (Classifier) |
| Caption | Any caption or subheading associated with Image | Extracted |
| Text | Extracted text from a structured chart | Extracted | Pending Research |
| Image location | Location (x,y) of chart within an image | Extracted |
| Image location max dimensions | Max dimensions (x\_max,y\_max) of location (x,y) | Extracted |
| uploaded\_image\_uri | Mirrors source\_metadata.source\_location | — |


## Table Metadata

The following is the metadata for tables within documents.

!!! warning
Tables should not be chunked

| Field | Description | Method |
|----------|----------------------------------------|----------|
| Table format | Structured (dataframe / lists of rows and columns), or serialized as markdown, html, latex, simple (cells separated as spaces). | Extracted |
| Table content | Extracted text content, formatted according to table\_metadata.table\_format. | Extracted |
| Table location | The bounding box of the table. | Extracted |
| Table location max dimensions | The max dimensions (x\_max,y\_max) of the bounding box of the table. | Extracted |
| Caption | The caption for the table or chart. | Extracted |
| Title | The title of the table. | Extracted |
| Subtitle | The subtitle of the table. | Extracted |
| Axis | Axis information for the table. | Extracted |
| uploaded\_image\_uri | A mirror of source\_metadata.source\_location. | Generated |


<!--
2025-01-23 NKM: Commenting out this section
I can find only the first (text) file, and it is empty
I can't find the other 2 files (images, charts and tables) at all
If we get the files, we can add this back

## Example Text Extracts for multimodal_test.pdf:
1. [text](example_processed_docs/text/multimodal_test.pdf.metadata.json)
2. [images](example_processed_docs/image/multimodal_test.pdf.metadata.json)
3. [charts and tables](example_processed_docs/structured/multimodal_test.pdf.metadata.json)
-->
6 changes: 3 additions & 3 deletions docs/docs/user-guide/contributing.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Contributing to NVIDIA-Ingest
# Contributing to NV-Ingest

External contributions to NVIDIA-Ingest will be welcome soon, and they are greatly appreciated!
For more information, refer to [Contributing to NVIDIA-Ingest](https://github.com/NVIDIA/nv-ingest/blob/main/CONTRIBUTING.md).
External contributions to NV-Ingest will be welcome soon, and they are greatly appreciated!
For more information, refer to [Contributing to NV-Ingest](https://github.com/NVIDIA/nv-ingest/blob/main/CONTRIBUTING.md).
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# NV-Ingest Deployment
# Deploy NV-Ingest

## Launch NVIDIA Microservice(s)

Expand Down
9 changes: 0 additions & 9 deletions docs/docs/user-guide/developer-guide/SUMMARY.md

This file was deleted.

61 changes: 0 additions & 61 deletions docs/docs/user-guide/developer-guide/content-metadata.md

This file was deleted.

Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Environment Configuration Variables
# Environment Configuration Variables for NV-Ingest

The following are the environment configuration variables that you can specify in your .env file.

Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Developing with Kubernetes
# Developing with NV-Ingest on Kubernetes

Developing directly on Kubernetes gives us more confidence that end-user deployments will work as expected.

Expand Down
Loading
Loading