Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XPath3.1: mimic handling of multiple root element nodes #2351

Open
wants to merge 41 commits into
base: master
Choose a base branch
from

Conversation

Constantin1489
Copy link
Contributor

@Constantin1489 Constantin1489 commented May 7, 2024

Obviously, some web server provides broken html.
The lxml and libxml2 fix it. It's good and indeed great!!! (We have been happy for decades!)

But, at the point, the error I want to solve occurs, the elementpath describes the DOM structure. it's because with some conditions, lxml or libxml2 returns multiple root element nodes when using html parser. (This could be a trace? of the browser wars. I don't remember the article but there were four kinds of html parser rules because of four major browsers.)

See also, https://gitlab.gnome.org/GNOME/libxml2/-/issues/716

So I mimicked it.
The test I included describes the point.

fixes #2318

@Constantin1489 Constantin1489 changed the title mimic several root element nodes handling mimic multiple root element nodes handling May 7, 2024
@Constantin1489 Constantin1489 changed the title mimic multiple root element nodes handling XPATH3.1: mimic multiple root element nodes handling May 7, 2024
requirements.txt Outdated
@@ -55,7 +55,7 @@ beautifulsoup4
lxml >=4.8.0,<6

# XPath 2.0-3.1 support - 4.2.0 broke something?
elementpath==4.1.5
elementpath==4.4.0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is time to upgrade?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is time to upgrade?

Sure, if the tests pass it's OK

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change was required to fix this PR?

Copy link
Contributor Author

@Constantin1489 Constantin1489 May 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this PR(#2351) uses fragment=True option, >=4.1.5 won't work. and 4.2.0 has another problem. So minimum is 4.2.1

@Constantin1489 Constantin1489 marked this pull request as draft May 7, 2024 16:33
@Constantin1489 Constantin1489 changed the title XPATH3.1: mimic multiple root element nodes handling XPATH3.1: mimic handling of multiple root element nodes May 7, 2024
@Constantin1489 Constantin1489 marked this pull request as ready for review May 7, 2024 16:59
])
def test_broken_DOM_01(html_content, xpath, answer):
# In normal situation, DOM's root element node is only one. So when DOM violation happens, Exception occurs.
with pytest.raises(Exception):
Copy link
Contributor Author

@Constantin1489 Constantin1489 May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intentionally add this test to reproduce the problem.
And, in the future, libxml2 may implement "html5"(https://gitlab.gnome.org/GNOME/libxml2/-/issues/211). As I posted the issue, this problem will be gone, and this test will fail. The day, please remove these tests.

@pytest.mark.parametrize("html_content", [DOM_violation_two_html_root_element])
@pytest.mark.parametrize("xpath, answer", [
("/html/body/p[1]", "First paragraph."),
("/html/body/p[1]", "Browsers parse this part by fixing it but lxml doesn't and returns two root element node"),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the critical point. why do I choose one element in the browser inspect window, but lxml returns two? Because there are two html tag elements and two body tag elements.

<p>First paragraph.</p>
</body>
</html>
<html>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second html root element.

@Constantin1489 Constantin1489 changed the title XPATH3.1: mimic handling of multiple root element nodes mimic handling of multiple root element nodes May 26, 2024
@Constantin1489 Constantin1489 changed the title mimic handling of multiple root element nodes XPath3.1: mimic handling of multiple root element nodes May 26, 2024
@dgtlmoon
Copy link
Owner

So this is nearly always caused by a missing <html open tag right?

@Constantin1489
Copy link
Contributor Author

Constantin1489 commented May 28, 2024

@dgtlmoon As I posted to https://gitlab.gnome.org/GNOME/libxml2/-/issues/716,
minimal codes are

<!DOCTYPE HTML>
<html></html>
<link href="/example/uri" rel="stylesheet" type="text/css" />

OR

<!DOCTYPE HTML>
<html></html>
Some string

OR

<!DOCTYPE HTML>
<html></html>
<Some/>

In this case, libxml2, and lxml returns two html root element nodes.

cat <<EOF | xmllint --html - --output
<!DOCTYPE HTML>
<html></html>
Some string
EOF

or

cat <<EOF | xmllint --html - --output
<!DOCTYPE HTML>
<html></html>
<Some/>     
EOF

@Constantin1489 Constantin1489 marked this pull request as draft June 13, 2024 15:37
@dgtlmoon
Copy link
Owner

please, could you update this with latest master ?

@dgtlmoon
Copy link
Owner

dgtlmoon commented Jul 9, 2024

https://gitlab.gnome.org/GNOME/libxml2/-/issues/211 yeah thats super interesting

"The HTML parser in libxml2 was written 20+ years ago. It does not implement HTML5. Maybe it will some day, maybe it won't. Don't use libxml2 to parse HTML for anything serious. If you maintain a downstream project that uses libxml2's HTML parser, please forward this message to your users."

@dgtlmoon
Copy link
Owner

dgtlmoon commented Jul 9, 2024

So basically this PR is making HTML5 work with libxml2 in a round-about way

@Constantin1489
Copy link
Contributor Author

Constantin1489 commented Aug 1, 2024

So basically this PR is making HTML5 work with libxml2 in a round-about way

That is not what I do. There is no HTML5 for libxml2 yet. The reason for multiple root elements is that the html parser doesn't implement DOM.

This PR allows parsing a non-well-form DOM tree similar.

https://gitlab.gnome.org/GNOME/libxml2/-/issues/211 yeah thats super interesting

I believe the issue I submitted will be fixed with it.

EDIT: add similar

@Constantin1489 Constantin1489 marked this pull request as ready for review August 1, 2024 09:08
add precise description
@Constantin1489
Copy link
Contributor Author

Previous test failed with an unrelated issue.

@Constantin1489
Copy link
Contributor Author

I'm not an expert. It's just my opinion.

If the html parser of libxml2 implements the html5, the benefit is some sort of predictability of HTML DOM, and security in general.

It's quite easy to expect xpath user will slightly need to change one's expression. It's difficult for me to say exhaustively.
But at least, XML elements inside HTML DOM or namespaces and tag name.

Previously html tag name was lowercase, but HTML5 may have XML elements inside, and the xml element may have be upper or lowercase (and also follows xml rule). (e.g: https://developer.mozilla.org/en-US/docs/Web/SVG/Element/textPath $x(".//*[local-name() = 'textPath']")vs $x(".//*[local-name() = 'textpath']")). So, if anybody says html tag is lowercase, it is just not about the xml element in the HTML DOM.

Also, some browsers don't support the namespace for xpath1.(I don't know how to express this sentence correctly) "//*:svg" or the Clark notation(and Clark notation similar) doesn't work in browsers. So, if html5 is parsed with xpath, for convenience, users will use "//*:svg" and the browser doesn't understand it.

@dgtlmoon
Copy link
Owner

dgtlmoon commented Sep 2, 2024

Im wondering if theres a way to only turn this on only when necessary? like do some check first?

or does it already do that?

@Constantin1489
Copy link
Contributor Author

As an idea, what about having this enabled by default as a config option?

The second PR(what you asked with this Q above) showed the side effects.

Im wondering if theres a way to only turn this on only when necessary? like do some check first?

The current reverted PR (first PR) that I submitted was exactly only turned on when it needed to. The sole reason it spits an error is that the root element node has another root element node as a sibling. I revert to the first PR.

@Constantin1489
Copy link
Contributor Author

Constantin1489 commented Sep 10, 2024

the point where fails(unrelated with this PR) occurred is assert b'1,890.45' or b'1890.45' in res.data
this is the same with

if b'1,890.45' or b'1890.45' in b"some string":
    print("pass")

So

assert b'1,890.45' in res.data or b'1890.45' in res.data would be better.

b'1,890.45' in b"some string" or b'1890.45' in b"some string"

EDIT: edit example

@dgtlmoon
Copy link
Owner

Amazing, thank you so much -> #2623

@dgtlmoon
Copy link
Owner

please update with master, i added some changes to move any LXML type stuff into its own subprocesses to help with memory leaks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

'str' object has no attribute '__name__' error on some xpath filters
2 participants