Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix html_as_text appearing in every element metadata #319

Merged
merged 10 commits into from
Feb 7, 2024

Conversation

mpolomdeepsense
Copy link
Contributor

This PR fixes the issue described here: Unstructured-IO/unstructured#2463

Now text_as_html will only be available for elements that are HTML strings (contain HTML tags)

E.g. output for non html element

{
    "element_id": "4a44dc15364204a80fe80e9039455cc1",
    "metadata": {
      "coordinates": {
        "layout_height": 3301,
        "layout_width": 2550,
        "points": [
          [170, 13],
          [170, 140],
          [427, 140],
          [427, 13]
        ],
        "system": "PixelSpace"
      },
      "file_directory": "/home/ubuntu/Documents",
      "filename": "purchasing-payment-policy-10.pdf",
      "filetype": "application/pdf",
      "languages": ["eng"],
      "last_modified": "2024-02-02T11:49:38",
      "page_number": 1,
      "parent_id": "e3b0c44298fc1c149afbf4c8996fb924"
    },
    "text": "10",
    "type": "UncategorizedText"
  }

E.g. output for html element

{
    "element_id": "398766f59dd6b37bd38b6d612159cd3e",
    "metadata": {
      "coordinates": {
        "layout_height": 3301,
        "layout_width": 2550,
        "points": [
          [433, 2180],
          [433, 2181],
          [2290, 2181],
          [2290, 2180]
        ],
        "system": "PixelSpace"
      },
      "file_directory": "/home/ubuntu/Documents",
      "filename": "purchasing-payment-policy-10.pdf",
      "filetype": "application/pdf",
      "languages": ["eng"],
      "last_modified": "2024-02-02T11:49:38",
      "page_number": 1,
      "text_as_html": "<table><tbody><tr><td></td><td> Subject Matter Expert / Department</td><td> Contract Review Responsibility</td><td></td></tr><tbody></table>"
    },
    "text": "Subject Matter Expert / Department Contract Review Responsibility",
    "type": "Table"
  }

Copy link
Contributor

@yuming-long yuming-long left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -31,6 +31,7 @@ class LayoutElement(TextRegion):
prob: Optional[float] = None
image_path: Optional[str] = None
parent: Optional[LayoutElement] = None
text_as_html: Optional[str] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this here?

Copy link
Contributor Author

@mpolomdeepsense mpolomdeepsense Feb 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was complaining about the missing text_as_html attribute when I added types to format_table_elements method. Since we use that attribute then I think it should be included in a class. I was also wondering whether it should go inside this class or the parent TextRegion class.

Copy link
Contributor

@yuming-long yuming-long Feb 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not following this... is this coming from your test text_as_html_attributes = [fe.text_as_html for fe in formatted_elements] (maybe use if hasattribute(fe, "text_as_html"))?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it was coming from models.chipper.UnstructuredChipperModel.format_table_elements method after I added types to the function definition. If there is a reason for not including it in the class then it can stay as was, I will just use the hasattribute like you proposed. I can see that it is used like that in unstructured repo so maybe I should just stick with that? What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to hasattr

@yuming-long yuming-long merged commit 8653c59 into main Feb 7, 2024
5 of 7 checks passed
@yuming-long yuming-long deleted the marek/fix/text_as_html-metadata branch February 7, 2024 14:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants