From 73fa16edd8c1afa2f07580beb9cf1774907cf34b Mon Sep 17 00:00:00 2001 From: Nicole McAllister Date: Wed, 26 Feb 2025 17:27:03 -0800 Subject: [PATCH 1/3] Update file suport list order --- README.md | 4 ++-- docs/docs/user-guide/index.md | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index caec04bc..6c941c67 100644 --- a/README.md +++ b/README.md @@ -37,11 +37,11 @@ NV-Ingest is a microservice service that does the following: NV-Ingest supports the following file types: +- `pdf` - `docx` +- `pptx` - `jpeg` -- `pdf` - `png` -- `pptx` - `svg` - `tiff` - `txt` diff --git a/docs/docs/user-guide/index.md b/docs/docs/user-guide/index.md index ce48a962..4e113898 100644 --- a/docs/docs/user-guide/index.md +++ b/docs/docs/user-guide/index.md @@ -25,11 +25,11 @@ NV-Ingest is a microservice service that does the following: NV-Ingest supports the following file types: +- `pdf` - `docx` +- `pptx` - `jpeg` -- `pdf` - `png` -- `pptx` - `svg` - `tiff` - `txt` From e91ce19c61081b7bcf264a2227107e57747f7844 Mon Sep 17 00:00:00 2001 From: Nicole McAllister Date: Wed, 26 Feb 2025 19:03:10 -0800 Subject: [PATCH 2/3] Update Table of Contents --- README.md | 12 +++++------ docs/Makefile | 2 +- docs/docs/SUMMARY.md | 2 -- .../text/multimodal_test.pdf.metadata.json | 0 docs/docs/index.md | 2 +- docs/docs/user-guide/SUMMARY.md | 6 ------ .../{developer-guide => }/content-metadata.md | 0 .../{developer-guide => }/deployment.md | 0 .../user-guide/developer-guide/SUMMARY.md | 9 -------- .../environment-config.md | 0 .../{developer-guide => }/kubernetes-dev.md | 0 .../{developer-guide => }/ngc-api-key.md | 0 .../{developer-guide => }/nv-ingest_cli.md | 0 .../docs/user-guide/{index.md => overview.md} | 0 docs/docs/user-guide/quickstart-guide.md | 8 +++---- .../{developer-guide => }/telemetry.md | 0 docs/mkdocs.yml | 21 +++++++++++++++++++ skaffold/README.md | 2 +- 18 files changed, 34 insertions(+), 30 deletions(-) delete mode 100644 docs/docs/SUMMARY.md rename docs/docs/{user-guide/developer-guide => }/example_processed_docs/text/multimodal_test.pdf.metadata.json (100%) delete mode 100644 docs/docs/user-guide/SUMMARY.md rename docs/docs/user-guide/{developer-guide => }/content-metadata.md (100%) rename docs/docs/user-guide/{developer-guide => }/deployment.md (100%) delete mode 100644 docs/docs/user-guide/developer-guide/SUMMARY.md rename docs/docs/user-guide/{developer-guide => }/environment-config.md (100%) rename docs/docs/user-guide/{developer-guide => }/kubernetes-dev.md (100%) rename docs/docs/user-guide/{developer-guide => }/ngc-api-key.md (100%) rename docs/docs/user-guide/{developer-guide => }/nv-ingest_cli.md (100%) rename docs/docs/user-guide/{index.md => overview.md} (100%) rename docs/docs/user-guide/{developer-guide => }/telemetry.md (100%) diff --git a/README.md b/README.md index 6c941c67..3416e851 100644 --- a/README.md +++ b/README.md @@ -84,7 +84,7 @@ To get started using NVIDIA Ingest, you need to do a few things: 4. [Inspect and consume results](#step-4-inspecting-and-consuming-results) 🔍 Optional: -1. [Direct Library Deployment](docs/docs/user-guide/developer-guide/deployment.md) 📦 +1. [Direct Library Deployment](docs/docs/user-guide/deployment.md) 📦 ### Step 1: Starting containers @@ -93,14 +93,14 @@ This example demonstrates how to use the provided [docker-compose.yaml](docker-c > [!IMPORTANT] > NIM containers on their first startup can take 10-15 minutes to pull and fully load models. -If you prefer, you can also [start services one by one](docs/docs/user-guide/developer-guide/deployment.md), or run on Kubernetes via [our Helm chart](helm/README.md). Also of note are [additional environment variables](docs/docs/user-guide/developer-guide/environment-config.md) you may wish to configure. +If you prefer, you can also [start services one by one](docs/docs/user-guide/deployment.md), or run on Kubernetes via [our Helm chart](helm/README.md). Also of note are [additional environment variables](docs/docs/user-guide/environment-config.md) you may wish to configure. 1. Git clone the repo: `git clone https://github.com/nvidia/nv-ingest` 2. Change directory to the cloned repo `cd nv-ingest`. -3. [Generate API keys](docs/docs/user-guide/developer-guide/ngc-api-key.md) and authenticate with NGC with the `docker login` command: +3. [Generate API keys](docs/docs/user-guide/ngc-api-key.md) and authenticate with NGC with the `docker login` command: ```shell # This is required to access pre-built containers and NIM microservices $ docker login nvcr.io @@ -112,7 +112,7 @@ Password: > During the early access (EA) phase, you must apply for early access here: https://developer.nvidia.com/nemo-microservices-early-access/join. > When your early access is approved, follow the instructions in the email to create an organization and team, link your profile, and generate your NGC API key. -4. Create a .env file that contains your NGC API keys. For more information, refer to [Environment Configuration Variables](docs/docs/user-guide/developer-guide/environment-config.md). +4. Create a .env file that contains your NGC API keys. For more information, refer to [Environment Configuration Variables](docs/docs/user-guide/environment-config.md). ``` # Container images must access resources from NGC. @@ -132,7 +132,7 @@ NVIDIA_BUILD_API_KEY= > `sudo nvidia-ctk runtime configure --runtime=docker --set-as-default` > [!NOTE] -> The most accurate tokenizer based splitting depends on the [llama-3.2 tokenizer](https://huggingface.co/meta-llama/Llama-3.2-1B). To download this model at container build time, you must set `DOWNLOAD_LLAMA_TOKENIZER=True` _and_ supply an authorized HuggingFace access token via `HF_ACCESS_TOKEN=`. If not, the ungated [e5-large-unsupervised](https://huggingface.co/intfloat/e5-large-unsupervised) tokenizer model will be downloaded instead. By default, the split task will use whichever model has been predownloaded. Refer to [Environment Configuration Variables](docs/docs/user-guide/developer-guide/environment-config.md) for more info. +> The most accurate tokenizer based splitting depends on the [llama-3.2 tokenizer](https://huggingface.co/meta-llama/Llama-3.2-1B). To download this model at container build time, you must set `DOWNLOAD_LLAMA_TOKENIZER=True` _and_ supply an authorized HuggingFace access token via `HF_ACCESS_TOKEN=`. If not, the ungated [e5-large-unsupervised](https://huggingface.co/intfloat/e5-large-unsupervised) tokenizer model will be downloaded instead. By default, the split task will use whichever model has been predownloaded. Refer to [Environment Configuration Variables](docs/docs/user-guide/environment-config.md) for more info. 5. Start all services: `docker compose --profile retrieval up` @@ -359,7 +359,7 @@ multimodal_test.pdf.metadata.json processed_docs/text: multimodal_test.pdf.metadata.json ``` -For the full metadata definitions, refer to [Content Metadata](/docs/docs/user-guide/developer-guide/content-metadata.md). +For the full metadata definitions, refer to [Content Metadata](/docs/docs/user-guide/content-metadata.md). #### We also provide a script for inspecting [extracted images](src/util/image_viewer.py) diff --git a/docs/Makefile b/docs/Makefile index fb5dd63c..46ebf474 100644 --- a/docs/Makefile +++ b/docs/Makefile @@ -5,7 +5,7 @@ # Define paths SPHINX_BUILD_DIR=sphinx_docs/build SPHINX_SOURCE_DIR=sphinx_docs/source -SPHINX_OUTPUT_DIR=docs/user-guide/developer-guide/api-docs +SPHINX_OUTPUT_DIR=docs/user-guide/api-docs # Default target .PHONY: all diff --git a/docs/docs/SUMMARY.md b/docs/docs/SUMMARY.md deleted file mode 100644 index 2689c887..00000000 --- a/docs/docs/SUMMARY.md +++ /dev/null @@ -1,2 +0,0 @@ -- [Home](index.md) -- [User Guide](user-guide/) diff --git a/docs/docs/user-guide/developer-guide/example_processed_docs/text/multimodal_test.pdf.metadata.json b/docs/docs/example_processed_docs/text/multimodal_test.pdf.metadata.json similarity index 100% rename from docs/docs/user-guide/developer-guide/example_processed_docs/text/multimodal_test.pdf.metadata.json rename to docs/docs/example_processed_docs/text/multimodal_test.pdf.metadata.json diff --git a/docs/docs/index.md b/docs/docs/index.md index 9c080e3e..e9b21fb9 100644 --- a/docs/docs/index.md +++ b/docs/docs/index.md @@ -14,6 +14,6 @@ hide: Install NV-Ingest and set up your environment to start accelerating your workflows. - [Get Started](user-guide){ .md-button .md-button } + [Get Started](user-guide/overview.md){ .md-button .md-button } diff --git a/docs/docs/user-guide/SUMMARY.md b/docs/docs/user-guide/SUMMARY.md deleted file mode 100644 index d910214b..00000000 --- a/docs/docs/user-guide/SUMMARY.md +++ /dev/null @@ -1,6 +0,0 @@ -- [What is NVIDIA Ingest?](index.md) -- [Prerequisites](prerequisites.md) -- [Quickstart](quickstart-guide.md) -- [Developer Guide](developer-guide/) -- [Contributing](contributing.md) -- [Release Notes](releasenotes-nv-ingest.md) diff --git a/docs/docs/user-guide/developer-guide/content-metadata.md b/docs/docs/user-guide/content-metadata.md similarity index 100% rename from docs/docs/user-guide/developer-guide/content-metadata.md rename to docs/docs/user-guide/content-metadata.md diff --git a/docs/docs/user-guide/developer-guide/deployment.md b/docs/docs/user-guide/deployment.md similarity index 100% rename from docs/docs/user-guide/developer-guide/deployment.md rename to docs/docs/user-guide/deployment.md diff --git a/docs/docs/user-guide/developer-guide/SUMMARY.md b/docs/docs/user-guide/developer-guide/SUMMARY.md deleted file mode 100644 index d238e7f2..00000000 --- a/docs/docs/user-guide/developer-guide/SUMMARY.md +++ /dev/null @@ -1,9 +0,0 @@ -- [Authenticating Local Docker with NGC](ngc-api-key.md) -- [Content Metadata](content-metadata.md) -- [NV-Ingest Deployment](deployment.md) -- [Environment Configuration Variables](environment-config.md) -- [Developing with Kubernetes](kubernetes-dev.md) -- [NV-Ingest Command Line (CLI)](nv-ingest_cli.md) -- [API Reference](api-docs) -- [Telemetry](telemetry.md) -- [Environment Configuration Variables](environment-config.md) diff --git a/docs/docs/user-guide/developer-guide/environment-config.md b/docs/docs/user-guide/environment-config.md similarity index 100% rename from docs/docs/user-guide/developer-guide/environment-config.md rename to docs/docs/user-guide/environment-config.md diff --git a/docs/docs/user-guide/developer-guide/kubernetes-dev.md b/docs/docs/user-guide/kubernetes-dev.md similarity index 100% rename from docs/docs/user-guide/developer-guide/kubernetes-dev.md rename to docs/docs/user-guide/kubernetes-dev.md diff --git a/docs/docs/user-guide/developer-guide/ngc-api-key.md b/docs/docs/user-guide/ngc-api-key.md similarity index 100% rename from docs/docs/user-guide/developer-guide/ngc-api-key.md rename to docs/docs/user-guide/ngc-api-key.md diff --git a/docs/docs/user-guide/developer-guide/nv-ingest_cli.md b/docs/docs/user-guide/nv-ingest_cli.md similarity index 100% rename from docs/docs/user-guide/developer-guide/nv-ingest_cli.md rename to docs/docs/user-guide/nv-ingest_cli.md diff --git a/docs/docs/user-guide/index.md b/docs/docs/user-guide/overview.md similarity index 100% rename from docs/docs/user-guide/index.md rename to docs/docs/user-guide/overview.md diff --git a/docs/docs/user-guide/quickstart-guide.md b/docs/docs/user-guide/quickstart-guide.md index 54b1ca9a..78bcfb70 100644 --- a/docs/docs/user-guide/quickstart-guide.md +++ b/docs/docs/user-guide/quickstart-guide.md @@ -16,7 +16,7 @@ This example demonstrates how to use the provided [docker-compose.yaml](https:// NIM containers on their first startup can take 10-15 minutes to pull and fully load models. -If you prefer, you can also [start services one by one](developer-guide/deployment.md) or run on Kubernetes by using [our Helm chart](https://github.com/NVIDIA/nv-ingest/blob/main/helm/README.md). Also, there are [additional environment variables](developer-guide/environment-config.md) you want to configure. +If you prefer, you can also [start services one by one](deployment.md) or run on Kubernetes by using [our Helm chart](https://github.com/NVIDIA/nv-ingest/blob/main/helm/README.md). Also, there are [additional environment variables](environment-config.md) you want to configure. 1. Git clone the repo: @@ -26,7 +26,7 @@ If you prefer, you can also [start services one by one](developer-guide/deployme `cd nv-ingest`. -3. [Generate API keys](developer-guide/ngc-api-key.md) and authenticate with NGC with the `docker login` command: +3. [Generate API keys](ngc-api-key.md) and authenticate with NGC with the `docker login` command: ```shell # This is required to access pre-built containers and NIM microservices @@ -39,7 +39,7 @@ If you prefer, you can also [start services one by one](developer-guide/deployme During the early access (EA) phase, you must apply for early access at [https://developer.nvidia.com/nemo-microservices-early-access/join](https://developer.nvidia.com/nemo-microservices-early-access/join). When your early access is approved, follow the instructions in the email to create an organization and team, link your profile, and generate your NGC API key. -4. Create a .env file containing your NGC API key and the following paths. For more information, refer to [Environment Configuration Variables](developer-guide/environment-config.md). +4. Create a .env file containing your NGC API key and the following paths. For more information, refer to [Environment Configuration Variables](environment-config.md). ``` # Container images must access resources from NGC. @@ -266,7 +266,7 @@ processed_docs/text: multimodal_test.pdf.metadata.json ``` -For the full metadata definitions, refer to [Content Metadata](developer-guide/content-metadata.md). +For the full metadata definitions, refer to [Content Metadata](content-metadata.md). We also provide a script for inspecting [extracted images](https://github.com/NVIDIA/nv-ingest/blob/main/src/util/image_viewer.py). diff --git a/docs/docs/user-guide/developer-guide/telemetry.md b/docs/docs/user-guide/telemetry.md similarity index 100% rename from docs/docs/user-guide/developer-guide/telemetry.md rename to docs/docs/user-guide/telemetry.md diff --git a/docs/mkdocs.yml b/docs/mkdocs.yml index 533e1f39..76d3c597 100644 --- a/docs/mkdocs.yml +++ b/docs/mkdocs.yml @@ -15,6 +15,7 @@ theme: - navigation.instant.prefetch - navigation.top - navigation.footer + - navigation.expand - search.suggest - search.highlight - content.code.copy @@ -47,6 +48,26 @@ extra_css: - assets/css/custom-material.css - assets/css/jupyter-themes.css +nav: + - Home: index.md + - User Guide: + - What is NVIDIA Ingest?: user-guide/overview.md + - Release Notes: user-guide/releasenotes-nv-ingest.md + - Get Started: + - Prerequisites: user-guide/prerequisites.md + - Generate Your NGC Keys: user-guide/ngc-api-key.md + - Quickstart: user-guide/quickstart-guide.md + - Developer Guide: + - NV-Ingest Deployment: user-guide/deployment.md + - Developing with Kubernetes: user-guide/kubernetes-dev.md + - Telemetry: user-guide/telemetry.md + - Contribute: user-guide/contributing.md + - Reference: + - Content Metadata: user-guide/content-metadata.md + - Environment Variables: user-guide/environment-config.md + - NV-Ingest CLI: user-guide/nv-ingest_cli.md + - API Reference: user-guide/api-docs.md + plugins: - search # uncomment below to grab api diff --git a/skaffold/README.md b/skaffold/README.md index 02d57ecd..772b3d1f 100644 --- a/skaffold/README.md +++ b/skaffold/README.md @@ -4,4 +4,4 @@ Skaffold is intended to support the nv-ingest development team with kubernetes d We offer k8s support via Helm and those instructions can be found at [Helm Documentation](../helm/README.md). -For developers further documentation for using Skaffold can be found at [Skaffold Documentation](/docs/docs/user-guide/developer-guide/kubernetes-dev.md). +For developers further documentation for using Skaffold can be found at [Skaffold Documentation](/docs/docs/user-guide/kubernetes-dev.md). From 4fda35972d3a53957eac72b1a81b69b0b1d72be7 Mon Sep 17 00:00:00 2001 From: Nicole McAllister Date: Wed, 26 Feb 2025 20:23:46 -0800 Subject: [PATCH 3/3] Update content metadata page, more TOC adjustments, add name note and fix a few names --- .devcontainer/README.md | 3 + CONTRIBUTING.md | 2 +- README.md | 7 +- client/README.md | 4 + docs/docs/index.md | 2 +- docs/docs/user-guide/content-metadata.md | 137 ++++++++++++------ docs/docs/user-guide/contributing.md | 6 +- docs/docs/user-guide/deployment.md | 2 +- docs/docs/user-guide/environment-config.md | 2 +- docs/docs/user-guide/kubernetes-dev.md | 2 +- docs/docs/user-guide/ngc-api-key.md | 4 +- docs/docs/user-guide/nv-ingest_cli.md | 5 +- docs/docs/user-guide/overview.md | 24 +-- docs/docs/user-guide/prerequisites.md | 6 +- docs/docs/user-guide/quickstart-guide.md | 8 +- .../docs/user-guide/releasenotes-nv-ingest.md | 2 +- docs/docs/user-guide/telemetry.md | 6 +- docs/mkdocs.yml | 12 +- examples/langchain_multimodal_rag.ipynb | 2 +- examples/llama_index_multimodal_rag.ipynb | 2 +- examples/store_and_display_images.ipynb | 2 +- helm/README.md | 9 +- skaffold/README.md | 9 +- 23 files changed, 160 insertions(+), 98 deletions(-) diff --git a/.devcontainer/README.md b/.devcontainer/README.md index 145fe821..c9116312 100644 --- a/.devcontainer/README.md +++ b/.devcontainer/README.md @@ -19,6 +19,9 @@ limitations under the License. The nv-ingest devcontainer is provided as a quick-to-set-up development and exploration environment for use with [Visual Studio Code](https://code.visualstudio.com) (Code). The devcontainer is a lightweight container which mounts-in a Conda environment with cached packages, alleviating long Conda download times on subsequent launches. It provides a simple framework for adding developer-centric [scripts](#development-scripts), and incorporates some helpful Code plugins. +> [!Note] +> NV-Ingest is also known as NVIDIA Ingest and NeMo Retriever Extraction. + More information about devcontainers can be found at [`containers.dev`](https://containers.dev/). ## Getting Started diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 7d28bd94..e2f84477 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -80,7 +80,7 @@ git submodule update --init --recursive ** [Create a pull request](https://github.com/NVIDIA/nv-ingest/pulls) once your code is ready. 5. **Code Review:** Wait for the review by other developers and make necessary updates. -6. **Merge:** Once approved, an NV-Ingest developer will approve your pull request. +6. **Merge:** After approval, an NVIDIA developer will approve your pull request. ### Seasoned Developers diff --git a/README.md b/README.md index 3416e851..71ebd0ae 100644 --- a/README.md +++ b/README.md @@ -8,6 +8,9 @@ SPDX-License-Identifier: Apache-2.0 NVIDIA-Ingest is a scalable, performance-oriented document content and metadata extraction microservice. Including support for parsing PDFs, Word and PowerPoint documents, it uses specialized NVIDIA NIM microservices to find, contextualize, and extract text, tables, charts and images for use in downstream generative applications. +> [!Note] +> NVIDIA Ingest is also known as NV-Ingest and NeMo Retriever Extraction. + NVIDIA Ingest enables parallelization of the process of splitting documents into pages where contents are classified (as tables, charts, images, text), extracted into discrete content, and further contextualized via optical character recognition (OCR) into a well defined JSON schema. From there, NVIDIA Ingest can optionally manage computation of embeddings for the extracted content, and also optionally manage storing into a vector database [Milvus](https://milvus.io/). > [!Note] @@ -188,7 +191,7 @@ ac27e5297d57 prom/prometheus:latest > > After the image builds, run `docker compose --profile retrieval up` or `docker compose up --build` as explained in the previous step. -### Step 2: Installing Python dependencies +### Step 2: Install Python dependencies To interact with the nv-ingest service, you can do so from the host, or by `docker exec`-ing into the nv-ingest container. @@ -225,7 +228,7 @@ pip install . ### Step 3: Ingesting Documents -You can submit jobs programmatically in Python or via the nv-ingest-cli tool. +You can submit jobs programmatically in Python or via the [NV-Ingest CLI](nv-ingest_cli.md). In the below examples, we are doing text, chart, table, and image extraction: diff --git a/client/README.md b/client/README.md index adfb6627..0e24921a 100644 --- a/client/README.md +++ b/client/README.md @@ -8,6 +8,10 @@ SPDX-License-Identifier: Apache-2.0 NV-Ingest-Client is a tool designed for efficient ingestion and processing of large datasets. It provides both a Python API and a command-line interface to cater to various ingestion needs. +> [!Note] +> NV-Ingest is also known as NVIDIA Ingest and NeMo Retriever Extraction. + + ## Table of Contents 1. [Installation](#installation) diff --git a/docs/docs/index.md b/docs/docs/index.md index e9b21fb9..648be762 100644 --- a/docs/docs/index.md +++ b/docs/docs/index.md @@ -3,7 +3,7 @@ hide: - navigation --- -**NV-Ingest** is a scalable, performance-oriented document content and metadata extraction microservice. NV-Ingest uses specialized NVIDIA NIM microservices to find, contextualize, and extract text, tables, charts and images for use in downstream generative applications.. You can access NV-Ingest as a free community resource or learn more about getting an enterprise license for improved expert-level support at the [NV-Ingest homepage](https://www.nvidia.com). +NeMo Retriever Extraction (NV-Ingest) is a scalable, performance-oriented document content and metadata extraction microservice. NV-Ingest uses specialized NVIDIA NIM microservices to find, contextualize, and extract text, tables, charts and images for use in downstream generative applications.. You can access NV-Ingest as a free community resource or learn more about getting an enterprise license for improved expert-level support at the [NV-Ingest homepage](https://www.nvidia.com).
diff --git a/docs/docs/user-guide/content-metadata.md b/docs/docs/user-guide/content-metadata.md index f4f4f702..3b2f45e8 100644 --- a/docs/docs/user-guide/content-metadata.md +++ b/docs/docs/user-guide/content-metadata.md @@ -1,51 +1,92 @@ -# Content Metadata - -**Definitions**: - -Source: The knowledge base file from which content and metadata is extracted - -Content: Data extracted from a source: Text or Image - -Metadata: Descriptive data which can be associated with Sources, Content(Image or Text); metadata can be extracted from Source/Content, or generated using models, heuristics, etc - -| | Field | Description | Method | -| ----- | :---- | :---- | :---- | -| Content | Content | Content extracted from Source | Extracted | -| Source Metadata | Source Name | Name of source | Extracted | -| | Source ID | ID of source | Extracted | -| | Source location | URL, URI, pointer to storage location | N/A | -| | Source Type | PDF, HTML, Docx, TXT, PPTx | Extracted | -| | Collection ID | Collection in which the source is contained | N/A | -| | Date Created | Date source was created | Extracted | | -| | Last Modified | Date source was last modified | Extracted | | -| | Summary | Summarization of Source Doc (Not Yet Implemented) | Generated | Pending Research | -| | Partition ID | Offset of this data fragment within a larger set of fragments | Generated | -| | Access Level | Dictates RBAC | N/A | -| Content Metadata (applicable to all content types) | Type | Text, Image, Structured, Table, Chart | Generated | -| | Description | Text Description of the content object (Image/Table) | Generated | -| | Page \# | Page \# where content is contained in source | Extracted | -| | Hierarchy | Location/order of content within the source document | Extracted | -| | Subtype | For structured data subtypes \- table, chart, etc.. | | | -| Text Metadata | Text Type | Header, body, etc | Extracted | -| | Summary | Abbreviated Summary of content (Not Yet Implemented) | Generated | Pending Research | -| | Keywords | Keywords, Named Entities, or other phrases | Extracted | N | -| | Language | | Generated | N | -| Image Metadata | Image Type | Structured, Natural,Hybrid, etc | Generated (Classifier) | Y(needs to be developed) | -| | Structured Image Type | Bar Chart, Pie Chart, etc | Generated (Classifier) | Y(needs to be developed) | -| | Caption | Any caption or subheader associated with Image | Extracted | -| | Text | Extracted text from a structured chart | Extracted | Pending Research | -| | Image location | Location (x,y) of chart within an image | Extracted | | -| | Image location max dimensions | Max dimensions (x\_max,y\_max) of location (x,y) | Extracted | | -| | uploaded\_image\_uri | Mirrors source\_metadata.source\_location | | | -| Table Metadata (tables within documents) | Table format | Structured (dataframe / lists of rows and columns), or serialized as markdown, html, latex, simple (cells separated just as spaces) | Extracted | -| | Table content | Extracted text content, formatted according to table\_metadata.table\_format. Important: Tables should not be chunked | Extracted | | -| | Table location | Bounding box of the table | Extracted | | -| | Table location max dimensions | Max dimensions (x\_max,y\_max) of bounding box of the table | Extracted | | -| | Caption | Detected captions for the table/chart | Extracted | | -| | Title | TODO | Extracted | | -| | Subtitle | TODO | Extracted | | -| | Axis | TODO | Extracted | | -| | uploaded\_image\_uri | Mirrors source\_metadata.source\_location | Generated | | +# Source and Content Metadata Reference for NV-Ingest + +This documentation contains the reference for the content metadata. +The definitions used in this documentation are the following: + +- **Source** — The knowledge base file from which content and metadata is extracted. +- **Content** — Data extracted from a source, such as text or an image. + +Metadata can be extracted from a source or content, or generated by using models, heuristics, or other methods. + + +## Source Metadata + +The following is the metadata for sources. + +| Field | Description | Method | +|----------|----------------------------------------|----------| +| Source Name | The name of the source. | Extracted | +| Source ID | The ID of the source. | Extracted | +| Source location | The URL, URI, or pointer to the storage location of the source. | — | +| Source Type | The file type of the source, such as pdf, docx, pptx, or txt. | Extracted | +| Collection ID | The ID of the collection in which the source is contained. | — | +| Date Created | The date the source was created. | Extracted | +| Last Modified | The date the source was last modified. | Extracted | +| Partition ID | The offset of this data fragment within a larger set of fragments. | Generated | +| Access Level | The role-based access control for the source. | — | +| Summary | A summary of the source. (Not yet implemented.) | Generated | + + +## Content Metadata + +The following is the metadata for content. +These fields apply to all content types including text, images, and tables. + +| Field | Description | Method | +|----------|----------------------------------------|----------| +| Type | The type of the content. Text, Image, Structured, Table, or Chart. | Generated | +| Subtype | The type of the content for structured data types, such as table or chart. | — | +| Content | Content extracted from the source. | Extracted | +| Description | A text description of the content object. | Generated | +| Page \# | The page \# of the content in the source. | Extracted | +| Hierarchy | The location or order of the content within the source. | Extracted | + + +## Text Metadata + +The following is the metadata for text. + +| Field | Description | Method | +|----------|----------------------------------------|----------| +| Text Type | The type of the text, such as header or body. | Extracted | +| Keywords | Keywords, Named Entities, or other phrases. | Extracted | +| Language | The language of the content. | Generated | +| Summary | An abbreviated summary of the content. (Not yet implemented.) | Generated | + + +## Image Metadata + +The following is the metadata for images. + +| Field | Description | Method | +|----------|----------------------------------------|----------| +| Image Type | The type of the image, such as structured, natural, hybrid, and others. | Generated (Classifier) | +| Structured Image Type | The type of the content for structured data types, such as bar chart, pie chart, and others. | Generated (Classifier) | +| Caption | Any caption or subheading associated with Image | Extracted | +| Text | Extracted text from a structured chart | Extracted | Pending Research | +| Image location | Location (x,y) of chart within an image | Extracted | +| Image location max dimensions | Max dimensions (x\_max,y\_max) of location (x,y) | Extracted | +| uploaded\_image\_uri | Mirrors source\_metadata.source\_location | — | + + +## Table Metadata + +The following is the metadata for tables within documents. + +!!! warning + Tables should not be chunked + +| Field | Description | Method | +|----------|----------------------------------------|----------| +| Table format | Structured (dataframe / lists of rows and columns), or serialized as markdown, html, latex, simple (cells separated as spaces). | Extracted | +| Table content | Extracted text content, formatted according to table\_metadata.table\_format. | Extracted | +| Table location | The bounding box of the table. | Extracted | +| Table location max dimensions | The max dimensions (x\_max,y\_max) of the bounding box of the table. | Extracted | +| Caption | The caption for the table or chart. | Extracted | +| Title | The title of the table. | Extracted | +| Subtitle | The subtitle of the table. | Extracted | +| Axis | Axis information for the table. | Extracted | +| uploaded\_image\_uri | A mirror of source\_metadata.source\_location. | Generated |