-
Notifications
You must be signed in to change notification settings - Fork 867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Save table prediction in cells format #2892
Conversation
@@ -8,9 +8,11 @@ | |||
# unstructured.documents.elements.Image | |||
from PIL import Image as PILImage | |||
from PIL import ImageSequence | |||
from unstructured_inference.models.tables import cells_to_html |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in unstructured
we use a decorator on functions to note if this function requires additional extras to run. And unstructured_inference
is one of those extras. As a result we don't import from unstructured_inference
at module level normally but import only where it is needed and decorate the function with @requires_dependencies("unstructured_inference")
. You can find examples in this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you for the explanation!
|
||
def test_simple_table_cell_parsing_from_table_transformer_when_missing_input(): | ||
table_transformer_cell = {"row_nums": [], "column_nums": [], "cell text": "text"} | ||
with pytest.raises(ValueError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets add a expected error message pattern matching here as well to make this test more explicit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
|
||
def test_simple_table_cell_parsing_from_table_transformer_when_missing_row_nums(): | ||
cell = {"row_nums": [], "column_nums": [1], "cell text": "text"} | ||
with pytest.raises(ValueError) as exception_info: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with pytest.raises(ValueError) as exception_info: | |
with pytest.raises(ValueError, match=f'Cell {str(cell)} has missing values under "row_nums" key'): |
def test_simple_table_cell_parsing_from_table_transformer_when_expected_input(): | ||
table_transformer_cell = {"row_nums": [3, 2, 1], "column_nums": [6, 7], "cell text": "text"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets use parametrize and add two more test cases:
- row nums has only 1 value
- column nums has only 1 value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
width = len(column_nums) | ||
height = len(row_nums) | ||
|
||
return cls(x=x, y=y, w=width, h=height, content=tatr_table_cell.get("cell text", "")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can combine 36-42 as one line
width = len(column_nums) | |
height = len(row_nums) | |
return cls(x=x, y=y, w=width, h=height, content=tatr_table_cell.get("cell text", "")) | |
width = len(column_nums) | |
height = len(row_nums) | |
return cls(x=min(column_nums), y=min(row_nums), w=len(column_nums), h=len(row_nums), content=tatr_table_cell.get("cell text", "")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea, done
x: int | ||
y: int | ||
w: int | ||
h: int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets use full names here to make the code read more clear
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have used 'w' and 'h' intentially to match deckard format as is currently used in mini-holistic. Then, for example, same code can be used during evaluation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please fix changelog; other than that it looks good
3da2937
to
56a97f2
Compare
This pull request add metrics that are calculated based on table_as_cells instead of text_as_html. This change is required for comprehensive metrics calculation, as previously every colspan or rowspan predicted was considered to be an incorrect predicted (even if it was correct prediction) This change has to be merged after #2892 which introduces table_as_cells field
This pull request allows to return predictions in raw cell representation from table transformer. It will be later used to save prediction in a cells format for simpler metrics calculation.
This PR has to be merged, after Unstructured-IO/unstructured-inference#335