Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: embedded text not getting merged with inferred elements #2679

Merged

Conversation

christinestraub
Copy link
Collaborator

@christinestraub christinestraub commented Mar 21, 2024

This PR is the second part of fixing "embedded text not getting merged with inferred elements", the first part is done in Unstructured-IO/unstructured-inference#331.

Summary

  • replace Rectangle.is_in() with Rectangle.is_almost_subregion_of() when removing pdfminer (embedded) elements that were merged with inferred elements
  • use env_config EMBEDDED_TEXT_AGGREGATION_SUBREGION_THRESHOLD introduced in the first part when removing pdfminer (embedded) elements that were merged with inferred elements
  • bump unstructured-inference to 0.7.25

Testing

PDF: pwc-financial-statements-p114.pdf

$ pip uninstall unstructured-inference -y
$ git clone -b fix/embedded-text-not-getting-merged-with-inferred-elements [email protected]:Unstructured-IO/unstructured-inference.git && cd unstructured-inference
$ pip install -e .
elements = partition_pdf(
    filename="pwc-financial-statements-p114.pdf",
    strategy="hi_res",
    infer_table_structure=True,
    extract_image_block_types=["Image"],
)

table_elements = [el for el in elements if el.category == "Table"]
print(table_elements[0].text)

ajjimeno and others added 2 commits March 21, 2024 11:48
…_of()` when removing pdfminer (embedded) elements merged with inferred elements
christinestraub added a commit to Unstructured-IO/unstructured-inference that referenced this pull request Mar 22, 2024
This PR is the first part of fixing "embedded text not getting merged
with inferred elements" and works together with the unstructured PR -
Unstructured-IO/unstructured#2679.

### Summary
- replace `Rectangle.is_in()` with `Rectangle.is_almost_subregion_of()`
when filling in an inferred element with embedded text
- add env_config `EMBEDDED_TEXT_AGGREGATION_SUBREGION_THRESHOLD`

### Note
The ingest test won't pass until we merge the unstructured PR -
Unstructured-IO/unstructured#2679.
…t test fixtures update (#2688)

This pull request includes updated ingest test fixtures.
Please review and merge if appropriate.

Co-authored-by: christinestraub <[email protected]>
@christinestraub christinestraub added this pull request to the merge queue Mar 23, 2024
Merged via the queue into main with commit 08fafc5 Mar 23, 2024
46 checks passed
@christinestraub christinestraub deleted the fix/embedded-text-not-getting-merged-with-inferred-elements branch March 23, 2024 04:37
kaaloo pushed a commit to inclusif/unstructured that referenced this pull request Apr 8, 2024
…tured-IO#2679)

This PR is the second part of fixing "embedded text not getting merged
with inferred elements", the first part is done in
Unstructured-IO/unstructured-inference#331.

### Summary
- replace `Rectangle.is_in()` with `Rectangle.is_almost_subregion_of()`
when removing pdfminer (embedded) elements that were merged with
inferred elements
- use env_config `EMBEDDED_TEXT_AGGREGATION_SUBREGION_THRESHOLD`
introduced in the [first
part](Unstructured-IO/unstructured-inference#331)
when removing pdfminer (embedded) elements that were merged with
inferred elements
- bump `unstructured-inference` to 0.7.25

### Testing
PDF:
[pwc-financial-statements-p114.pdf](https://github.com/Unstructured-IO/unstructured/files/14707146/pwc-financial-statements-p114.pdf)

```
$ pip uninstall unstructured-inference -y
$ git clone -b fix/embedded-text-not-getting-merged-with-inferred-elements [email protected]:Unstructured-IO/unstructured-inference.git && cd unstructured-inference
$ pip install -e .
```

```
elements = partition_pdf(
    filename="pwc-financial-statements-p114.pdf",
    strategy="hi_res",
    infer_table_structure=True,
    extract_image_block_types=["Image"],
)

table_elements = [el for el in elements if el.category == "Table"]
print(table_elements[0].text)
```

---------

Co-authored-by: Antonio Jose Jimeno Yepes <[email protected]>
Co-authored-by: ryannikolaidis <[email protected]>
Co-authored-by: christinestraub <[email protected]>
kaaloo pushed a commit to inclusif/unstructured that referenced this pull request Apr 8, 2024
…tured-IO#2679)

This PR is the second part of fixing "embedded text not getting merged
with inferred elements", the first part is done in
Unstructured-IO/unstructured-inference#331.

### Summary
- replace `Rectangle.is_in()` with `Rectangle.is_almost_subregion_of()`
when removing pdfminer (embedded) elements that were merged with
inferred elements
- use env_config `EMBEDDED_TEXT_AGGREGATION_SUBREGION_THRESHOLD`
introduced in the [first
part](Unstructured-IO/unstructured-inference#331)
when removing pdfminer (embedded) elements that were merged with
inferred elements
- bump `unstructured-inference` to 0.7.25

### Testing
PDF:
[pwc-financial-statements-p114.pdf](https://github.com/Unstructured-IO/unstructured/files/14707146/pwc-financial-statements-p114.pdf)

```
$ pip uninstall unstructured-inference -y
$ git clone -b fix/embedded-text-not-getting-merged-with-inferred-elements [email protected]:Unstructured-IO/unstructured-inference.git && cd unstructured-inference
$ pip install -e .
```

```
elements = partition_pdf(
    filename="pwc-financial-statements-p114.pdf",
    strategy="hi_res",
    infer_table_structure=True,
    extract_image_block_types=["Image"],
)

table_elements = [el for el in elements if el.category == "Table"]
print(table_elements[0].text)
```

---------

Co-authored-by: Antonio Jose Jimeno Yepes <[email protected]>
Co-authored-by: ryannikolaidis <[email protected]>
Co-authored-by: christinestraub <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants