-
Notifications
You must be signed in to change notification settings - Fork 219
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Update Table of Contents, update metadata page, add name note, fix a …
…few names and typos (#495)
- Loading branch information
Showing
29 changed files
with
207 additions
and
141 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,102 @@ | ||
# Source and Content Metadata Reference for NV-Ingest | ||
|
||
This documentation contains the reference for the content metadata. | ||
The definitions used in this documentation are the following: | ||
|
||
- **Source** — The knowledge base file from which content and metadata is extracted. | ||
- **Content** — Data extracted from a source, such as text or an image. | ||
|
||
Metadata can be extracted from a source or content, or generated by using models, heuristics, or other methods. | ||
|
||
|
||
## Source Metadata | ||
|
||
The following is the metadata for sources. | ||
|
||
| Field | Description | Method | | ||
|----------|----------------------------------------|----------| | ||
| Source Name | The name of the source. | Extracted | | ||
| Source ID | The ID of the source. | Extracted | | ||
| Source location | The URL, URI, or pointer to the storage location of the source. | — | | ||
| Source Type | The file type of the source, such as pdf, docx, pptx, or txt. | Extracted | | ||
| Collection ID | The ID of the collection in which the source is contained. | — | | ||
| Date Created | The date the source was created. | Extracted | | ||
| Last Modified | The date the source was last modified. | Extracted | | ||
| Partition ID | The offset of this data fragment within a larger set of fragments. | Generated | | ||
| Access Level | The role-based access control for the source. | — | | ||
| Summary | A summary of the source. (Not yet implemented.) | Generated | | ||
|
||
|
||
## Content Metadata | ||
|
||
The following is the metadata for content. | ||
These fields apply to all content types including text, images, and tables. | ||
|
||
| Field | Description | Method | | ||
|----------|----------------------------------------|----------| | ||
| Type | The type of the content. Text, Image, Structured, Table, or Chart. | Generated | | ||
| Subtype | The type of the content for structured data types, such as table or chart. | — | | ||
| Content | Content extracted from the source. | Extracted | | ||
| Description | A text description of the content object. | Generated | | ||
| Page \# | The page \# of the content in the source. | Extracted | | ||
| Hierarchy | The location or order of the content within the source. | Extracted | | ||
|
||
|
||
## Text Metadata | ||
|
||
The following is the metadata for text. | ||
|
||
| Field | Description | Method | | ||
|----------|----------------------------------------|----------| | ||
| Text Type | The type of the text, such as header or body. | Extracted | | ||
| Keywords | Keywords, Named Entities, or other phrases. | Extracted | | ||
| Language | The language of the content. | Generated | | ||
| Summary | An abbreviated summary of the content. (Not yet implemented.) | Generated | | ||
|
||
|
||
## Image Metadata | ||
|
||
The following is the metadata for images. | ||
|
||
| Field | Description | Method | | ||
|----------|----------------------------------------|----------| | ||
| Image Type | The type of the image, such as structured, natural, hybrid, and others. | Generated (Classifier) | | ||
| Structured Image Type | The type of the content for structured data types, such as bar chart, pie chart, and others. | Generated (Classifier) | | ||
| Caption | Any caption or subheading associated with Image | Extracted | | ||
| Text | Extracted text from a structured chart | Extracted | Pending Research | | ||
| Image location | Location (x,y) of chart within an image | Extracted | | ||
| Image location max dimensions | Max dimensions (x\_max,y\_max) of location (x,y) | Extracted | | ||
| uploaded\_image\_uri | Mirrors source\_metadata.source\_location | — | | ||
|
||
|
||
## Table Metadata | ||
|
||
The following is the metadata for tables within documents. | ||
|
||
!!! warning | ||
Tables should not be chunked | ||
|
||
| Field | Description | Method | | ||
|----------|----------------------------------------|----------| | ||
| Table format | Structured (dataframe / lists of rows and columns), or serialized as markdown, html, latex, simple (cells separated as spaces). | Extracted | | ||
| Table content | Extracted text content, formatted according to table\_metadata.table\_format. | Extracted | | ||
| Table location | The bounding box of the table. | Extracted | | ||
| Table location max dimensions | The max dimensions (x\_max,y\_max) of the bounding box of the table. | Extracted | | ||
| Caption | The caption for the table or chart. | Extracted | | ||
| Title | The title of the table. | Extracted | | ||
| Subtitle | The subtitle of the table. | Extracted | | ||
| Axis | Axis information for the table. | Extracted | | ||
| uploaded\_image\_uri | A mirror of source\_metadata.source\_location. | Generated | | ||
|
||
|
||
<!-- | ||
2025-01-23 NKM: Commenting out this section | ||
I can find only the first (text) file, and it is empty | ||
I can't find the other 2 files (images, charts and tables) at all | ||
If we get the files, we can add this back | ||
## Example Text Extracts for multimodal_test.pdf: | ||
1. [text](example_processed_docs/text/multimodal_test.pdf.metadata.json) | ||
2. [images](example_processed_docs/image/multimodal_test.pdf.metadata.json) | ||
3. [charts and tables](example_processed_docs/structured/multimodal_test.pdf.metadata.json) | ||
--> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,4 @@ | ||
# Contributing to NVIDIA-Ingest | ||
# Contributing to NV-Ingest | ||
|
||
External contributions to NVIDIA-Ingest will be welcome soon, and they are greatly appreciated! | ||
For more information, refer to [Contributing to NVIDIA-Ingest](https://github.com/NVIDIA/nv-ingest/blob/main/CONTRIBUTING.md). | ||
External contributions to NV-Ingest will be welcome soon, and they are greatly appreciated! | ||
For more information, refer to [Contributing to NV-Ingest](https://github.com/NVIDIA/nv-ingest/blob/main/CONTRIBUTING.md). |
2 changes: 1 addition & 1 deletion
2
.../user-guide/developer-guide/deployment.md → docs/docs/user-guide/deployment.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,4 @@ | ||
# NV-Ingest Deployment | ||
# Deploy NV-Ingest | ||
|
||
## Launch NVIDIA Microservice(s) | ||
|
||
|
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
2 changes: 1 addition & 1 deletion
2
...ide/developer-guide/environment-config.md → docs/docs/user-guide/environment-config.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
2 changes: 1 addition & 1 deletion
2
...r-guide/developer-guide/kubernetes-dev.md → docs/docs/user-guide/kubernetes-dev.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.