-
-
Notifications
You must be signed in to change notification settings - Fork 285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
List element inside a table is lost #531
Comments
I was able to fix it this way: diff --git a/trafilatura/core.py b/trafilatura/core.py
index 63699a4..1970c25 100644
--- a/trafilatura/core.py
+++ b/trafilatura/core.py
@@ -397,7 +397,7 @@ def handle_table(table_elem, potential_tags, options):
# add child element to processed_element
if processed_subchild is not None:
subchildelem = SubElement(newchildelem, processed_subchild.tag)
- subchildelem.text, subchildelem.tail = processed_subchild.text, processed_subchild.tail
+ subchildelem.text, subchildelem.tail = ''.join(processed_subchild.itertext()), processed_subchild.tail
child.tag = 'done'
# add to tree
if newchildelem.text or len(newchildelem) > 0: But not sure if this is the correct solution |
@mikhainin Thank you for reporting the bug and the solution, could you please draft a PR with your solution? If the tests pass I would integrate it. |
Sure, I just filed #534 |
Note: the issue is now fixed if recall option is on. |
Try it for spotify https://www.spotify.com/in-en/legal/privacy-policy/ |
source.html
```htmlThis code returns only days (first column) but an important information (timestamps) are missing.
However, if I supply
include_tables=False
, I can see the timestamps:The text was updated successfully, but these errors were encountered: