-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/chipper repetitions #314
Conversation
…/unstructured-inference into feat/chipper-repetitions
tested the PR with the soldering pdf locally and found compared to main (file 1) we got:
|
Co-authored-by: Yao You <[email protected]>
Co-authored-by: Yao You <[email protected]>
@badGarnet Just in case it is relevant. When comparing the main branch and this one, "chipper" points to "chipperv2" in the main branch and to "chipperv3" in this branch. Probably you realised about it, but just in case. |
I confirmed what Yao found, that for the file soldering-iron-manual.pdf on page 3, there are no repeated elements on the This is concerning because it's one example of the PR making the problem worse. It may be a one-off and this PR may fix more cases than it breaks, but it suggests we need to look at the statistics to determine whether this makes things better or worse. |
@qued @badGarnet repetition has been solved. There was a condition to move from the initial beam search size = 1 to beam search size = 3 that I tried to optimize. I reverted it to the original implementation of NGramRepetitonStoppingCriteria. |
Closing, as I don't think this is actively being worked on. We can re-open if needed. |
In some cases Chipper repeats elements. This PR has additional mechanisms to detect these repetitions and provides mechanisms for filtering repetitions that cannot be identified during decoding.
Repetition detection:
Additional filtering:
Here are two example images. One is processed as a table, since it was modelled like that in the ground truth, the problem is that the table is never finished, chipper generates unlimited "" pair of tokens. The second has some repetitions that may happen due to the images in the document. With the proposed PR, these repetitions disappear.
The PDF document made Chipper generate repetitions but only under Linux.
Example code for the images.
Example code for the PDF file:
RAND_RRA2977-1.pdf
A previous PR had issues with unstructured when running some mini holistic documents. This was due to a problem with bounding boxes and as well with nested elements (e.g. List and List-items). An exception would happen when running the mini holistic documents below. The example code below allows testing it. You might need to have to use the main branch in unstructured since recently a bug related to layout elements was identified and a different exception might happen. The code prints the unstructured version, so it is possible to check which version is being used. The output of the processing is stored in the out.un.json file.
soldering-iron-manual.pdf
TX-penal-code-T8-36-1-3.pdf