Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XPath3.1: mimic handling of multiple root element nodes #2351

Open
wants to merge 41 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 18 commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
8e1f170
html_tools/fix: Add forest_transplanting to handle invalid DOM
Constantin1489 May 2, 2024
1f776ff
requirements/fix: Upgrade and pin elementpath to support fragment option
Constantin1489 May 2, 2024
bf5c2c7
html_tools/fix:
Constantin1489 May 2, 2024
9f0cb35
html_tools/fix: Another option
Constantin1489 May 2, 2024
879d0b2
html_tools/fix:
Constantin1489 May 7, 2024
ed2aaf4
tests/test_xpath_selector_unit/test: Add test.
Constantin1489 May 7, 2024
dd8b4fe
html_tools/docs: Remove comments
Constantin1489 May 7, 2024
fbd5512
tests/test_xpath_selector_unit/fix: Typo
Constantin1489 May 7, 2024
20195e7
tests/test_xpath_selector_unit/test: Fix test and add more small test…
Constantin1489 May 7, 2024
220f484
tests/test_xpath_selector_unit/test: Check error occurs.
Constantin1489 May 7, 2024
e84b9f1
tests/test_xpath_selector_unit/test: Fix
Constantin1489 May 7, 2024
60777e4
tests/test_xpath_selector_unit/test: Add more unintuitive tests
Constantin1489 May 7, 2024
e325e02
tests/test_xpath_selector_unit/test: Trigger test again
Constantin1489 May 7, 2024
6a2e1cf
tests/test_xpath_selector_unit/fix: Trigger test again. why it doesn'…
Constantin1489 May 7, 2024
55b2c6c
tests/test_xpath_selector_unit/test: Oops fix test name
Constantin1489 May 7, 2024
93a9585
tests/test_xpath_selector_unit/test: Failed successfully
Constantin1489 May 7, 2024
e6b13c9
tests/test_xpath_selector_unit/test: Add count test
Constantin1489 May 7, 2024
2e3e781
tests/test_xpath_selector_unit/chore: Trigger CICD
Constantin1489 May 7, 2024
c295c5e
tests/test_xpath_selector_unit/test: Add same behavior for xpath 1
Constantin1489 May 7, 2024
5acd31f
tests/test_xpath_selector_unit/test: Fix misc
Constantin1489 May 7, 2024
de7b66b
tests/test_xpath_selector_unit/test: Fix answer
Constantin1489 May 7, 2024
66a7dae
html_tools/docs: Fix old comment
Constantin1489 May 7, 2024
4d266ca
tests/test_xpath_selector_unit/feat: Do forest_transplanting by default
Constantin1489 May 9, 2024
ebf7fd4
tests/test_xpath_selector_unit/test: Fix tests
Constantin1489 May 9, 2024
26e4a58
tests/test_xpath_selector_unit/test: Add context node related tests
Constantin1489 May 9, 2024
dbf4e87
requirements/chore: Change minimum version of elementpath
Constantin1489 May 16, 2024
7cd764f
html_tools/fix: Improve speed for function calls
Constantin1489 May 17, 2024
48a5aa2
Merge branch 'dgtlmoon:master' into transplanting
Constantin1489 May 22, 2024
3619877
Revert "html_tools/docs: Fix old comment"
Constantin1489 May 26, 2024
c79d88e
Merge branch 'dgtlmoon:master' into transplanting
Constantin1489 Aug 1, 2024
827f81a
Update html_tools.py description
Constantin1489 Aug 1, 2024
23c6471
Merge branch 'dgtlmoon:master' into transplanting
Constantin1489 Sep 10, 2024
e6ac285
Revert "tests/test_xpath_selector_unit/test: Fix tests"
Constantin1489 Sep 10, 2024
0a0f281
Revert "tests/test_xpath_selector_unit/feat: Do forest_transplanting …
Constantin1489 Sep 10, 2024
3223820
Reapply "html_tools/docs: Fix old comment"
Constantin1489 Sep 10, 2024
93950c0
Update html_tools.py to trigger test
Constantin1489 Sep 10, 2024
0e66cb0
Update html_tools.py document for trigger test
Constantin1489 Sep 10, 2024
889fdbb
Update html_tools.py comment to trigger test
Constantin1489 Sep 10, 2024
4043e9a
html_tools/feat: Add logger for forest transplanting.
Constantin1489 Sep 13, 2024
1743ca0
Merge branch 'dgtlmoon:master' into transplanting
Constantin1489 Sep 13, 2024
912470f
html_tools/docs: Add string to trigger test
Constantin1489 Sep 13, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 34 additions & 1 deletion changedetectionio/html_tools.py
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,38 @@ def elementpath_tostring(obj):

return str(obj)

def forest_transplanting(root):
"""
libxml2 violates DOM rules. it means there can be multiple root element
nodes. So I choose just transplating them to a new root by default.
See also, https://gitlab.gnome.org/GNOME/libxml2/-/issues/716
This will emulate xpath1 of html of libxml2 like '/html[2]/*'.
To make this function work, 'fragment=True' in elementpath.select is required.
"""
from lxml import etree
from itertools import chain
root_siblings_preceding = [ s for s in root.itersiblings(preceding=True)]
root_siblings = [s for s in root.itersiblings()]

Is_fragment=False
# If element node exsits in root element node's sibilings, it is fragment.
for node in chain(root_siblings_preceding, root_siblings):
if not hasattr(node.tag, '__name__'):
Is_fragment=True
# early exit. because the root is already root element.
# So, two root element nodes are detected. DOM violation.
break

if Is_fragment:
new_root = etree.Element("new_root")
root_siblings_preceding.reverse()
for node in chain(root_siblings_preceding, [root], root_siblings):
new_root.append(node)
return new_root, True

return root, False
Constantin1489 marked this conversation as resolved.
Show resolved Hide resolved


# Return str Utf-8 of matched rules
def xpath_filter(xpath_filter, html_content, append_pretty_line_formatting=False, is_rss=False):
from lxml import etree, html
Expand All @@ -123,9 +155,10 @@ def xpath_filter(xpath_filter, html_content, append_pretty_line_formatting=False
parser = etree.XMLParser(strip_cdata=False)

tree = html.fromstring(bytes(html_content, encoding='utf-8'), parser=parser)
tree, is_fragment = forest_transplanting(tree)
html_block = ""

r = elementpath.select(tree, xpath_filter.strip(), namespaces={'re': 'http://exslt.org/regular-expressions'}, parser=XPath3Parser)
r = elementpath.select(tree, xpath_filter.strip(), namespaces={'re': 'http://exslt.org/regular-expressions'}, parser=XPath3Parser, fragment=is_fragment)
#@note: //title/text() wont work where <title>CDATA..

if type(r) != list:
Expand Down
58 changes: 58 additions & 0 deletions changedetectionio/tests/test_xpath_selector_unit.py
Original file line number Diff line number Diff line change
Expand Up @@ -201,3 +201,61 @@ def test_trips(html_content, xpath, answer):
html_content = html_tools.xpath_filter(xpath, html_content, append_pretty_line_formatting=True)
assert type(html_content) == str
assert answer in html_content

DOM_violation_two_html_root_element = """<!DOCTYPE html>
<html>
<body>
<h1>Hello world</h1>
<p>First paragraph.</p>
</body>
</html>
<html>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second html root element.

<body>
<h1>Hello world</h1>
<p>Browsers parse this part by fixing it but lxml doesn't and returns two root element node</p>
<p>Therefore, if the path is /html/body/p[1], lxml(libxml2) returns two element nodes not one.</p>
</body>
</html>"""
@pytest.mark.parametrize("html_content", [DOM_violation_two_html_root_element])
@pytest.mark.parametrize("xpath, answer", [
("/html/body/p[1]", "First paragraph."),
("/html/body/p[1]", "Browsers parse this part by fixing it but lxml doesn't and returns two root element node"),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the critical point. why do I choose one element in the browser inspect window, but lxml returns two? Because there are two html tag elements and two body tag elements.

("count(/html/body/p[1])", "2"),
("count(/html)", "2"),
("count(//html)", "2"),
("count(//body)", "2"),
("count(/html/body)", "2"),
("//html/body/p[1]", "First paragraph."),
("//html/body/p[1]", "Browsers parse this part by fixing it but lxml doesn't and returns two root element node"),
("//body/p[1]", "First paragraph."),
("//body/p[1]", "Browsers parse this part by fixing it but lxml doesn't and returns two root element node"),
("/html[2]/body/p[1]", "Browsers parse this part by fixing it but lxml doesn't and returns two root element node"),
("//html[2]/body/p[1]", "Browsers parse this part by fixing it but lxml doesn't and returns two root element node"),
])
def test_broken_DOM_01(html_content, xpath, answer):
# In normal situation, DOM's root element node is only one. So when DOM violation happens, Exception occurs.
with pytest.raises(Exception):
Copy link
Contributor Author

@Constantin1489 Constantin1489 May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intentionally add this test to reproduce the problem.
And, in the future, libxml2 may implement "html5"(https://gitlab.gnome.org/GNOME/libxml2/-/issues/211). As I posted the issue, this problem will be gone, and this test will fail. The day, please remove these tests.

from lxml import etree, html
import elementpath
from elementpath.xpath3 import XPath3Parser
parser = etree.HTMLParser()
tree = html.fromstring(bytes(html_content, encoding='utf-8'), parser=parser)
# just example xpath
# Error will occur.
r = elementpath.select(tree, xpath.strip(), namespaces={'re': 'http://exslt.org/regular-expressions'}, parser=XPath3Parser)

html_content = html_tools.xpath_filter(xpath, html_content, append_pretty_line_formatting=True)
assert type(html_content) == str
assert answer in html_content

@pytest.mark.parametrize("html_content", [DOM_violation_two_html_root_element])
@pytest.mark.parametrize("xpath, answer", [
("/html[2]/body/p[1]", "First paragraph."),
("//html[2]/body/p[1]", "First paragraph."),
])
def test_Broken_DOM_02(html_content, xpath, answer):
# In normal situation, DOM's root element node is only one. So when DOM violation happens, Exception occurs.
html_content = html_tools.xpath_filter(xpath, html_content, append_pretty_line_formatting=True)
assert type(html_content) == str
# Check the answer is *not in* the html_content
assert answer not in html_content
2 changes: 1 addition & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ beautifulsoup4
lxml >=4.8.0,<6

# XPath 2.0-3.1 support - 4.2.0 broke something?
elementpath==4.1.5
elementpath==4.4.0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is time to upgrade?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is time to upgrade?

Sure, if the tests pass it's OK

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change was required to fix this PR?

Copy link
Contributor Author

@Constantin1489 Constantin1489 May 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this PR(#2351) uses fragment=True option, >=4.1.5 won't work. and 4.2.0 has another problem. So minimum is 4.2.1


selenium~=4.14.0

Expand Down