Save table prediction in cells format #2892

plutasnyy · 2024-04-15T16:59:38Z

This pull request allows to return predictions in raw cell representation from table transformer. It will be later used to save prediction in a cells format for simpler metrics calculation.

This PR has to be merged, after Unstructured-IO/unstructured-inference#335

unstructured/partition/common.py

badGarnet · 2024-04-17T13:36:32Z

unstructured/partition/pdf_image/ocr.py

@@ -8,9 +8,11 @@
 # unstructured.documents.elements.Image
 from PIL import Image as PILImage
 from PIL import ImageSequence
+from unstructured_inference.models.tables import cells_to_html


in unstructured we use a decorator on functions to note if this function requires additional extras to run. And unstructured_inference is one of those extras. As a result we don't import from unstructured_inference at module level normally but import only where it is needed and decorate the function with @requires_dependencies("unstructured_inference"). You can find examples in this file.

thank you for the explanation!

unstructured/partition/pdf_image/ocr.py

badGarnet · 2024-04-17T13:38:47Z

test_unstructured/metrics/test_table_formats.py

+
+def test_simple_table_cell_parsing_from_table_transformer_when_missing_input():
+    table_transformer_cell = {"row_nums": [], "column_nums": [], "cell text": "text"}
+    with pytest.raises(ValueError):


lets add a expected error message pattern matching here as well to make this test more explicit

unstructured/metrics/table/table_formats.py

…s dependency

badGarnet · 2024-04-19T14:55:37Z

test_unstructured/metrics/test_table_formats.py

+
+def test_simple_table_cell_parsing_from_table_transformer_when_missing_row_nums():
+    cell = {"row_nums": [], "column_nums": [1], "cell text": "text"}
+    with pytest.raises(ValueError) as exception_info:


Suggested change

with pytest.raises(ValueError) as exception_info:

with pytest.raises(ValueError, match=f'Cell {str(cell)} has missing values under "row_nums" key'):

test_unstructured/metrics/test_table_formats.py

badGarnet · 2024-04-19T14:57:59Z

test_unstructured/metrics/test_table_formats.py

+def test_simple_table_cell_parsing_from_table_transformer_when_expected_input():
+    table_transformer_cell = {"row_nums": [3, 2, 1], "column_nums": [6, 7], "cell text": "text"}


lets use parametrize and add two more test cases:

row nums has only 1 value

column nums has only 1 value

unstructured/metrics/table/table_formats.py

badGarnet · 2024-04-19T15:00:29Z

unstructured/metrics/table/table_formats.py

+        width = len(column_nums)
+        height = len(row_nums)
+
+        return cls(x=x, y=y, w=width, h=height, content=tatr_table_cell.get("cell text", ""))


we can combine 36-42 as one line

Suggested change

width = len(column_nums)

height = len(row_nums)

return cls(x=x, y=y, w=width, h=height, content=tatr_table_cell.get("cell text", ""))

width = len(column_nums)

height = len(row_nums)

return cls(x=min(column_nums), y=min(row_nums), w=len(column_nums), h=len(row_nums), content=tatr_table_cell.get("cell text", ""))

Great idea, done

badGarnet · 2024-04-19T15:00:54Z

unstructured/metrics/table/table_formats.py

+    x: int
+    y: int
+    w: int
+    h: int


lets use full names here to make the code read more clear

I have used 'w' and 'h' intentially to match deckard format as is currently used in mini-holistic. Then, for example, same code can be used during evaluation

badGarnet

please fix changelog; other than that it looks good

This pull request add metrics that are calculated based on table_as_cells instead of text_as_html. This change is required for comprehensive metrics calculation, as previously every colspan or rowspan predicted was considered to be an incorrect predicted (even if it was correct prediction) This change has to be merged after #2892 which introduces table_as_cells field

Save table prediction in cells format

0a47187

plutasnyy requested a review from badGarnet April 15, 2024 16:59

plutasnyy self-assigned this Apr 15, 2024

Update changelog

c73714b

plutasnyy temporarily deployed to ci April 15, 2024 17:08 — with GitHub Actions Inactive

plutasnyy mentioned this pull request Apr 17, 2024

Add calculation of table related metrics based on table_as_cells #2898

Merged

badGarnet reviewed Apr 17, 2024

View reviewed changes

unstructured/partition/common.py Outdated Show resolved Hide resolved

badGarnet reviewed Apr 17, 2024

View reviewed changes

unstructured/partition/pdf_image/ocr.py Outdated Show resolved Hide resolved

badGarnet reviewed Apr 17, 2024

View reviewed changes

unstructured/metrics/table/table_formats.py Outdated Show resolved Hide resolved

plutasnyy added 2 commits April 18, 2024 12:35

Add unstructured_infernece dependency decorator, remove more_itertool…

8b71376

…s dependency

Merge remote-tracking branch 'origin/main' into feat/table-as-cells

038dd60

plutasnyy temporarily deployed to ci April 18, 2024 11:06 — with GitHub Actions Inactive

badGarnet reviewed Apr 19, 2024

View reviewed changes

test_unstructured/metrics/test_table_formats.py Outdated Show resolved Hide resolved

badGarnet reviewed Apr 19, 2024

View reviewed changes

unstructured/metrics/table/table_formats.py Outdated Show resolved Hide resolved

badGarnet reviewed Apr 19, 2024

View reviewed changes

plutasnyy added 4 commits April 22, 2024 12:50

Merge remote-tracking branch 'origin/main' into feat/table-as-cells

b78714b

Simplify SimpleTableCell, add more unit tests

add0304

Rever variable names

6b372db

Bump version

3db1a00

plutasnyy temporarily deployed to ci April 22, 2024 11:21 — with GitHub Actions Inactive

plutasnyy added 2 commits April 22, 2024 16:11

Update CHANGELOG.md

91acbb8

Update CHANGELOG.md

4877f79

plutasnyy temporarily deployed to ci April 22, 2024 17:30 — with GitHub Actions Inactive

plutasnyy added 2 commits April 23, 2024 12:00

Merge remote-tracking branch 'origin/main' into feat/table-as-cells

9ac010a

Python 3.9 support

d47865a

plutasnyy temporarily deployed to ci April 23, 2024 10:17 — with GitHub Actions Inactive

Fix matching pattern

1b5beb1

plutasnyy temporarily deployed to ci April 23, 2024 11:22 — with GitHub Actions Inactive

plutasnyy marked this pull request as ready for review April 23, 2024 11:42

plutasnyy requested a review from badGarnet April 23, 2024 11:42

Add consolidation strategy

20aae6b

plutasnyy temporarily deployed to ci April 23, 2024 12:08 — with GitHub Actions Inactive

badGarnet reviewed Apr 24, 2024

View reviewed changes

Update CHANGELOG.md

56a97f2

badGarnet approved these changes Apr 24, 2024

View reviewed changes

plutasnyy temporarily deployed to ci April 24, 2024 15:53 — with GitHub Actions Inactive

plutasnyy force-pushed the feat/table-as-cells branch from 3da2937 to 56a97f2 Compare April 25, 2024 10:35

Merge remote-tracking branch 'origin/main' into feat/table-as-cells

045dfa7

plutasnyy temporarily deployed to ci April 25, 2024 10:38 — with GitHub Actions Inactive

Fix tests

6f9cbd3

plutasnyy temporarily deployed to ci April 25, 2024 10:40 — with GitHub Actions Inactive

plutasnyy added this pull request to the merge queue Apr 25, 2024

Merged via the queue into main with commit df1f7bc Apr 25, 2024
42 checks passed

plutasnyy deleted the feat/table-as-cells branch April 25, 2024 11:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save table prediction in cells format #2892

Save table prediction in cells format #2892

plutasnyy commented Apr 15, 2024

badGarnet Apr 17, 2024

plutasnyy Apr 18, 2024

plutasnyy Apr 18, 2024

badGarnet Apr 17, 2024

plutasnyy Apr 18, 2024

badGarnet Apr 19, 2024

badGarnet Apr 19, 2024

plutasnyy Apr 22, 2024

badGarnet Apr 19, 2024

plutasnyy Apr 22, 2024

badGarnet Apr 19, 2024

plutasnyy Apr 22, 2024

badGarnet left a comment

	with pytest.raises(ValueError) as exception_info:
	with pytest.raises(ValueError, match=f'Cell {str(cell)} has missing values under "row_nums" key'):

		def test_simple_table_cell_parsing_from_table_transformer_when_expected_input():
		table_transformer_cell = {"row_nums": [3, 2, 1], "column_nums": [6, 7], "cell text": "text"}

Save table prediction in cells format #2892

Save table prediction in cells format #2892

Conversation

plutasnyy commented Apr 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

badGarnet left a comment

Choose a reason for hiding this comment