XPath3.1: mimic handling of multiple root element nodes #2351

Constantin1489 · 2024-05-07T15:46:17Z

Obviously, some web server provides broken html.
The lxml and libxml2 fix it. It's good and indeed great!!! (We have been happy for decades!)

But, at the point, the error I want to solve occurs, the elementpath describes the DOM structure. it's because with some conditions, lxml or libxml2 returns multiple root element nodes when using html parser. (This could be a trace? of the browser wars. I don't remember the article but there were four kinds of html parser rules because of four major browsers.)

See also, https://gitlab.gnome.org/GNOME/libxml2/-/issues/716

So I mimicked it.
The test I included describes the point.

fixes #2318

…s for fragment

Constantin1489 · 2024-05-07T15:49:09Z

requirements.txt

@@ -55,7 +55,7 @@ beautifulsoup4
 lxml >=4.8.0,<6

 # XPath 2.0-3.1 support - 4.2.0 broke something?
-elementpath==4.1.5
+elementpath==4.4.0


Is time to upgrade?

Is time to upgrade?

Sure, if the tests pass it's OK

this change was required to fix this PR?

Since this PR(#2351) uses fragment=True option, >=4.1.5 won't work. and 4.2.0 has another problem. So minimum is 4.2.1

changedetectionio/html_tools.py

…t work like my repo

Constantin1489 · 2024-05-07T17:02:23Z

changedetectionio/tests/test_xpath_selector_unit.py

+                          ])
+def test_broken_DOM_01(html_content, xpath, answer):
+    # In normal situation, DOM's root element node is only one. So when DOM violation happens, Exception occurs.
+    with pytest.raises(Exception):


I intentionally add this test to reproduce the problem.
And, in the future, libxml2 may implement "html5"(https://gitlab.gnome.org/GNOME/libxml2/-/issues/211). As I posted the issue, this problem will be gone, and this test will fail. The day, please remove these tests.

Constantin1489 · 2024-05-07T17:12:40Z

changedetectionio/tests/test_xpath_selector_unit.py

+@pytest.mark.parametrize("html_content", [DOM_violation_two_html_root_element])
+@pytest.mark.parametrize("xpath, answer", [
+    ("/html/body/p[1]", "First paragraph."),
+    ("/html/body/p[1]", "Browsers parse this part by fixing it but lxml doesn't and returns two root element node"),


This is the critical point. why do I choose one element in the browser inspect window, but lxml returns two? Because there are two html tag elements and two body tag elements.

Constantin1489 · 2024-05-07T17:39:54Z

changedetectionio/tests/test_xpath_selector_unit.py

+    <p>First paragraph.</p>
+  </body>
+</html>
+<html>


The second html root element.

This reverts commit 66a7dae.

dgtlmoon · 2024-05-28T09:14:25Z

So this is nearly always caused by a missing <html open tag right?

Constantin1489 · 2024-05-28T11:52:12Z

@dgtlmoon As I posted to https://gitlab.gnome.org/GNOME/libxml2/-/issues/716,
minimal codes are

<!DOCTYPE HTML>
<html></html>
<link href="/example/uri" rel="stylesheet" type="text/css" />

OR

<!DOCTYPE HTML>
<html></html>
Some string

OR

<!DOCTYPE HTML>
<html></html>
<Some/>

In this case, libxml2, and lxml returns two html root element nodes.

cat <<EOF | xmllint --html - --output
<!DOCTYPE HTML>
<html></html>
Some string
EOF

or

cat <<EOF | xmllint --html - --output
<!DOCTYPE HTML>
<html></html>
<Some/>     
EOF

dgtlmoon · 2024-06-25T11:32:30Z

please, could you update this with latest master ?

dgtlmoon · 2024-07-09T08:08:19Z

https://gitlab.gnome.org/GNOME/libxml2/-/issues/211 yeah thats super interesting

"The HTML parser in libxml2 was written 20+ years ago. It does not implement HTML5. Maybe it will some day, maybe it won't. Don't use libxml2 to parse HTML for anything serious. If you maintain a downstream project that uses libxml2's HTML parser, please forward this message to your users."

dgtlmoon · 2024-07-09T08:08:45Z

So basically this PR is making HTML5 work with libxml2 in a round-about way

Constantin1489 · 2024-08-01T09:04:59Z

So basically this PR is making HTML5 work with libxml2 in a round-about way

That is not what I do. There is no HTML5 for libxml2 yet. The reason for multiple root elements is that the html parser doesn't implement DOM.

This PR allows parsing a non-well-form DOM tree similar.

https://gitlab.gnome.org/GNOME/libxml2/-/issues/211 yeah thats super interesting

I believe the issue I submitted will be fixed with it.

EDIT: add similar

add precise description

Constantin1489 · 2024-08-01T09:36:17Z

Previous test failed with an unrelated issue.

Constantin1489 · 2024-08-01T10:03:20Z

I'm not an expert. It's just my opinion.

If the html parser of libxml2 implements the html5, the benefit is some sort of predictability of HTML DOM, and security in general.

It's quite easy to expect xpath user will slightly need to change one's expression. It's difficult for me to say exhaustively.
But at least, XML elements inside HTML DOM or namespaces and tag name.

Previously html tag name was lowercase, but HTML5 may have XML elements inside, and the xml element may have be upper or lowercase (and also follows xml rule). (e.g: https://developer.mozilla.org/en-US/docs/Web/SVG/Element/textPath $x(".//*[local-name() = 'textPath']")vs $x(".//*[local-name() = 'textpath']")). So, if anybody says html tag is lowercase, it is just not about the xml element in the HTML DOM.

Also, some browsers don't support the namespace for xpath1.(I don't know how to express this sentence correctly) "//*:svg" or the Clark notation(and Clark notation similar) doesn't work in browsers. So, if html5 is parsed with xpath, for convenience, users will use "//*:svg" and the browser doesn't understand it.

dgtlmoon · 2024-09-02T11:38:34Z

Im wondering if theres a way to only turn this on only when necessary? like do some check first?

or does it already do that?

This reverts commit ebf7fd4.

…by default" This reverts commit 4d266ca.

This reverts commit 3619877.

Constantin1489 · 2024-09-10T17:54:26Z

As an idea, what about having this enabled by default as a config option?

The second PR(what you asked with this Q above) showed the side effects.

Im wondering if theres a way to only turn this on only when necessary? like do some check first?

The current reverted PR (first PR) that I submitted was exactly only turned on when it needed to. The sole reason it spits an error is that the root element node has another root element node as a sibling. I revert to the first PR.

just blanks

Constantin1489 · 2024-09-10T19:28:48Z

the point where fails(unrelated with this PR) occurred is assert b'1,890.45' or b'1890.45' in res.data
this is the same with

if b'1,890.45' or b'1890.45' in b"some string":
    print("pass")

So

assert b'1,890.45' in res.data or b'1890.45' in res.data would be better.

b'1,890.45' in b"some string" or b'1890.45' in b"some string"

EDIT: edit example

dgtlmoon · 2024-09-11T08:59:05Z

Amazing, thank you so much -> #2623

dgtlmoon · 2024-09-13T17:33:33Z

please update with master, i added some changes to move any LXML type stuff into its own subprocesses to help with memory leaks

Constantin1489 added 12 commits May 2, 2024 20:41

html_tools/fix: Add forest_transplanting to handle invalid DOM

8e1f170

requirements/fix: Upgrade and pin elementpath to support fragment option

1f776ff

html_tools/fix:

bf5c2c7

html_tools/fix: Another option

9f0cb35

html_tools/fix:

879d0b2

tests/test_xpath_selector_unit/test: Add test.

ed2aaf4

html_tools/docs: Remove comments

dd8b4fe

tests/test_xpath_selector_unit/fix: Typo

fbd5512

tests/test_xpath_selector_unit/test: Fix test and add more small test…

20195e7

…s for fragment

tests/test_xpath_selector_unit/test: Check error occurs.

220f484

tests/test_xpath_selector_unit/test: Fix

e84b9f1

tests/test_xpath_selector_unit/test: Add more unintuitive tests

60777e4

Constantin1489 changed the title ~~mimic several root element nodes handling~~ mimic multiple root element nodes handling May 7, 2024

Constantin1489 changed the title ~~mimic multiple root element nodes handling~~ XPATH3.1: mimic multiple root element nodes handling May 7, 2024

Constantin1489 commented May 7, 2024

View reviewed changes

changedetectionio/html_tools.py Show resolved Hide resolved

Constantin1489 added 2 commits May 8, 2024 01:04

tests/test_xpath_selector_unit/test: Trigger test again

e325e02

tests/test_xpath_selector_unit/fix: Trigger test again. why it doesn'…

6a2e1cf

…t work like my repo

Constantin1489 marked this pull request as draft May 7, 2024 16:33

Constantin1489 added 2 commits May 8, 2024 01:36

tests/test_xpath_selector_unit/test: Oops fix test name

55b2c6c

tests/test_xpath_selector_unit/test: Failed successfully

93a9585

Constantin1489 changed the title ~~XPATH3.1: mimic multiple root element nodes handling~~ XPATH3.1: mimic handling of multiple root element nodes May 7, 2024

Constantin1489 marked this pull request as ready for review May 7, 2024 16:59

Constantin1489 commented May 7, 2024

View reviewed changes

Constantin1489 added 2 commits May 8, 2024 02:16

tests/test_xpath_selector_unit/test: Add count test

e6b13c9

tests/test_xpath_selector_unit/chore: Trigger CICD

2e3e781

Constantin1489 commented May 7, 2024

View reviewed changes

Constantin1489 added 2 commits May 8, 2024 02:50

tests/test_xpath_selector_unit/test: Add same behavior for xpath 1

c295c5e

tests/test_xpath_selector_unit/test: Fix misc

5acd31f

Revert "html_tools/docs: Fix old comment"

3619877

This reverts commit 66a7dae.

Constantin1489 changed the title ~~XPATH3.1: mimic handling of multiple root element nodes~~ mimic handling of multiple root element nodes May 26, 2024

Constantin1489 changed the title ~~mimic handling of multiple root element nodes~~ XPath3.1: mimic handling of multiple root element nodes May 26, 2024

Constantin1489 marked this pull request as draft June 13, 2024 15:37

Merge branch 'dgtlmoon:master' into transplanting

c79d88e

Constantin1489 marked this pull request as ready for review August 1, 2024 09:08

Update html_tools.py description

827f81a

add precise description

Constantin1489 added 4 commits September 11, 2024 00:10

Merge branch 'dgtlmoon:master' into transplanting

23c6471

Revert "tests/test_xpath_selector_unit/test: Fix tests"

e6ac285

This reverts commit ebf7fd4.

Revert "tests/test_xpath_selector_unit/feat: Do forest_transplanting …

0a0f281

…by default" This reverts commit 4d266ca.

Reapply "html_tools/docs: Fix old comment"

3223820

This reverts commit 3619877.

Constantin1489 added 2 commits September 11, 2024 03:23

Update html_tools.py to trigger test

93950c0

just blanks

Update html_tools.py document for trigger test

0e66cb0

Update html_tools.py comment to trigger test

889fdbb

html_tools/feat: Add logger for forest transplanting.

4043e9a

Constantin1489 added 2 commits September 14, 2024 03:10

Merge branch 'dgtlmoon:master' into transplanting

1743ca0

html_tools/docs: Add string to trigger test

912470f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XPath3.1: mimic handling of multiple root element nodes #2351

XPath3.1: mimic handling of multiple root element nodes #2351

Constantin1489 commented May 7, 2024 •

edited

Loading

Constantin1489 May 7, 2024

dgtlmoon May 8, 2024

dgtlmoon May 15, 2024

Constantin1489 May 16, 2024 •

edited

Loading

Constantin1489 May 7, 2024 •

edited

Loading

Constantin1489 May 7, 2024

Constantin1489 May 7, 2024

dgtlmoon commented May 28, 2024

Constantin1489 commented May 28, 2024 •

edited

Loading

dgtlmoon commented Jun 25, 2024

dgtlmoon commented Jul 9, 2024

dgtlmoon commented Jul 9, 2024

Constantin1489 commented Aug 1, 2024 •

edited

Loading

Constantin1489 commented Aug 1, 2024

Constantin1489 commented Aug 1, 2024

dgtlmoon commented Sep 2, 2024

Constantin1489 commented Sep 10, 2024

Constantin1489 commented Sep 10, 2024 •

edited

Loading

dgtlmoon commented Sep 11, 2024

dgtlmoon commented Sep 13, 2024

XPath3.1: mimic handling of multiple root element nodes #2351

Are you sure you want to change the base?

XPath3.1: mimic handling of multiple root element nodes #2351

Conversation

Constantin1489 commented May 7, 2024 • edited Loading

Constantin1489 May 7, 2024

Choose a reason for hiding this comment

dgtlmoon May 8, 2024

Choose a reason for hiding this comment

dgtlmoon May 15, 2024

Choose a reason for hiding this comment

Constantin1489 May 16, 2024 • edited Loading

Choose a reason for hiding this comment

Constantin1489 May 7, 2024 • edited Loading

Choose a reason for hiding this comment

Constantin1489 May 7, 2024

Choose a reason for hiding this comment

Constantin1489 May 7, 2024

Choose a reason for hiding this comment

dgtlmoon commented May 28, 2024

Constantin1489 commented May 28, 2024 • edited Loading

dgtlmoon commented Jun 25, 2024

dgtlmoon commented Jul 9, 2024

dgtlmoon commented Jul 9, 2024

Constantin1489 commented Aug 1, 2024 • edited Loading

Constantin1489 commented Aug 1, 2024

Constantin1489 commented Aug 1, 2024

dgtlmoon commented Sep 2, 2024

Constantin1489 commented Sep 10, 2024

Constantin1489 commented Sep 10, 2024 • edited Loading

dgtlmoon commented Sep 11, 2024

dgtlmoon commented Sep 13, 2024

Constantin1489 commented May 7, 2024 •

edited

Loading

Constantin1489 May 16, 2024 •

edited

Loading

Constantin1489 May 7, 2024 •

edited

Loading

Constantin1489 commented May 28, 2024 •

edited

Loading

Constantin1489 commented Aug 1, 2024 •

edited

Loading

Constantin1489 commented Sep 10, 2024 •

edited

Loading