Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: embedded text not getting merged with inferred elements #331

Conversation

christinestraub
Copy link
Contributor

@christinestraub christinestraub commented Mar 21, 2024

This PR is the first part of fixing "embedded text not getting merged with inferred elements" and works together with the unstructured PR - Unstructured-IO/unstructured#2679.

Summary

  • replace Rectangle.is_in() with Rectangle.is_almost_subregion_of() when filling in an inferred element with embedded text
  • add env_config EMBEDDED_TEXT_AGGREGATION_SUBREGION_THRESHOLD

Note

The ingest test won't pass until we merge the unstructured PR - Unstructured-IO/unstructured#2679.

@cragwolfe
Copy link
Contributor

should be a release commit?

@christinestraub christinestraub merged commit 4a2fd95 into main Mar 22, 2024
5 of 7 checks passed
@christinestraub christinestraub deleted the fix/embedded-text-not-getting-merged-with-inferred-elements branch March 22, 2024 20:22
github-merge-queue bot pushed a commit to Unstructured-IO/unstructured that referenced this pull request Mar 23, 2024
This PR is the second part of fixing "embedded text not getting merged
with inferred elements", the first part is done in
Unstructured-IO/unstructured-inference#331.

### Summary
- replace `Rectangle.is_in()` with `Rectangle.is_almost_subregion_of()`
when removing pdfminer (embedded) elements that were merged with
inferred elements
- use env_config `EMBEDDED_TEXT_AGGREGATION_SUBREGION_THRESHOLD`
introduced in the [first
part](Unstructured-IO/unstructured-inference#331)
when removing pdfminer (embedded) elements that were merged with
inferred elements
- bump `unstructured-inference` to 0.7.25

### Testing
PDF:
[pwc-financial-statements-p114.pdf](https://github.com/Unstructured-IO/unstructured/files/14707146/pwc-financial-statements-p114.pdf)

```
$ pip uninstall unstructured-inference -y
$ git clone -b fix/embedded-text-not-getting-merged-with-inferred-elements [email protected]:Unstructured-IO/unstructured-inference.git && cd unstructured-inference
$ pip install -e .
```

```
elements = partition_pdf(
    filename="pwc-financial-statements-p114.pdf",
    strategy="hi_res",
    infer_table_structure=True,
    extract_image_block_types=["Image"],
)

table_elements = [el for el in elements if el.category == "Table"]
print(table_elements[0].text)
```

---------

Co-authored-by: Antonio Jose Jimeno Yepes <[email protected]>
Co-authored-by: ryannikolaidis <[email protected]>
Co-authored-by: christinestraub <[email protected]>
kaaloo pushed a commit to inclusif/unstructured that referenced this pull request Apr 8, 2024
…tured-IO#2679)

This PR is the second part of fixing "embedded text not getting merged
with inferred elements", the first part is done in
Unstructured-IO/unstructured-inference#331.

### Summary
- replace `Rectangle.is_in()` with `Rectangle.is_almost_subregion_of()`
when removing pdfminer (embedded) elements that were merged with
inferred elements
- use env_config `EMBEDDED_TEXT_AGGREGATION_SUBREGION_THRESHOLD`
introduced in the [first
part](Unstructured-IO/unstructured-inference#331)
when removing pdfminer (embedded) elements that were merged with
inferred elements
- bump `unstructured-inference` to 0.7.25

### Testing
PDF:
[pwc-financial-statements-p114.pdf](https://github.com/Unstructured-IO/unstructured/files/14707146/pwc-financial-statements-p114.pdf)

```
$ pip uninstall unstructured-inference -y
$ git clone -b fix/embedded-text-not-getting-merged-with-inferred-elements [email protected]:Unstructured-IO/unstructured-inference.git && cd unstructured-inference
$ pip install -e .
```

```
elements = partition_pdf(
    filename="pwc-financial-statements-p114.pdf",
    strategy="hi_res",
    infer_table_structure=True,
    extract_image_block_types=["Image"],
)

table_elements = [el for el in elements if el.category == "Table"]
print(table_elements[0].text)
```

---------

Co-authored-by: Antonio Jose Jimeno Yepes <[email protected]>
Co-authored-by: ryannikolaidis <[email protected]>
Co-authored-by: christinestraub <[email protected]>
kaaloo pushed a commit to inclusif/unstructured that referenced this pull request Apr 8, 2024
…tured-IO#2679)

This PR is the second part of fixing "embedded text not getting merged
with inferred elements", the first part is done in
Unstructured-IO/unstructured-inference#331.

### Summary
- replace `Rectangle.is_in()` with `Rectangle.is_almost_subregion_of()`
when removing pdfminer (embedded) elements that were merged with
inferred elements
- use env_config `EMBEDDED_TEXT_AGGREGATION_SUBREGION_THRESHOLD`
introduced in the [first
part](Unstructured-IO/unstructured-inference#331)
when removing pdfminer (embedded) elements that were merged with
inferred elements
- bump `unstructured-inference` to 0.7.25

### Testing
PDF:
[pwc-financial-statements-p114.pdf](https://github.com/Unstructured-IO/unstructured/files/14707146/pwc-financial-statements-p114.pdf)

```
$ pip uninstall unstructured-inference -y
$ git clone -b fix/embedded-text-not-getting-merged-with-inferred-elements [email protected]:Unstructured-IO/unstructured-inference.git && cd unstructured-inference
$ pip install -e .
```

```
elements = partition_pdf(
    filename="pwc-financial-statements-p114.pdf",
    strategy="hi_res",
    infer_table_structure=True,
    extract_image_block_types=["Image"],
)

table_elements = [el for el in elements if el.category == "Table"]
print(table_elements[0].text)
```

---------

Co-authored-by: Antonio Jose Jimeno Yepes <[email protected]>
Co-authored-by: ryannikolaidis <[email protected]>
Co-authored-by: christinestraub <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants