-
-
Notifications
You must be signed in to change notification settings - Fork 285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed lists inside tables when include_tables=True #534
Conversation
@mikhainin It turns out your fix doesn't work well for nested elements in tables (see tests). Could you please have a look at it and see if you find a solution? |
@adbar, From what I can see, this test-case is failing: htmlstring = html.fromstring(
"""<html>
<body><article>
<table>
<tbody>
<tr>
<td>
<small>text<br></small>
<h4>more_text</h4>
</td>
<td><a href='link'>linktext</a></td>
</tr>
</tbody>
</table>
</article></body>
</html>"""
)
processed = extract(
htmlstring, no_fallback=True, output_format='xml', config=DEFAULT_CONFIG, include_links=True
)
result = processed.replace('\n', '').replace(' ', '')
assert """<table><row><cell>text<head>more_text</head></cell></row></table>""" in result New version (the current change) produces: <doccategories=""tags=""fingerprint="576b5da16c181e08"><main>
<table>
<row>
<cell>text
<head>more_text</head>
</cell>
<cell>
<p>linktext</p>
</cell>
</row>
</table>
</main><comments/></doc> The current master (expected behaviour I suppose): <doccategories=""tags=""fingerprint="d7ffdfa76c785e1c"><main>
<table>
<row>
<cell>text
<head>more_text</head>
</cell>
</row>
</table>
</main><comments/></doc> Could you explain why |
Yes, you're perfectly right, we need to change this test then. Do you want to do it? By the way, you could also add a test for the particular problem you're solving. Your PR slightly decreases precision on the benchmark, it could have something to do with undesirable content getting added by |
I can give a try with it but I would need some guidance: I'm familiar with this library for less than a week :) |
Of course, I'll look for potential ways to fix it and give you the necessary info. |
The code is convoluted and badly needs to be simplified, however HTML documents are not as regular as they should on the web so that's why a lot of safeguards have been implemented at different levels in the course of time. |
Thanks - that's helpful! I updated the implementation it pass our tests and Trafilatura's. Could you take a look once again, please? |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #534 +/- ##
==========================================
- Coverage 97.58% 97.49% -0.09%
==========================================
Files 23 23
Lines 3389 3394 +5
==========================================
+ Hits 3307 3309 +2
- Misses 82 85 +3 ☔ View full report in Codecov by Sentry. |
The walrus operator is not available before Python 3.8 but that's a minor issue. |
Removed the operator |
The tests pass but on the benchmark the precision is lower. It could be that tables with too many nested elements contain undesirable text. |
46a986e
to
954897d
Compare
I rebased on the latest master and run
Master:
If you tell me the right way to run the benchmark, please? |
I still see a small difference, you need to re-install the package from the branch to see it: It's not a big deal, in the worst case your PR would be used when |
Yeah in |
I would then change the condition to |
hows the code preserve the indentation of the list items? |
A possible fix for #531