-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
'str' object has no attribute '__name__'
error on some xpath filters
#2318
Comments
tried latest elementpath |
This comment was marked as resolved.
This comment was marked as resolved.
the error comes from elementpath.. tried different versions, same outcome... |
This comment was marked as resolved.
This comment was marked as resolved.
are you saying you cant reproduce the issue? |
You need to compare the HTML then both in the chrome JS rendered version and using |
Hi, |
I found the solution but I need time to ensure. |
I took a look at this just to try and brush up on my pdb skills. The issue here is that lxml believes the html from that site is invalid. There's an issue with elementpath.select() assuming it's on a non-empty tree and not handling that correctly (this is where the exception is coming from). I think an improvement changedetection.io can do here is to check the parser.error_log for errors, maybe only with empty trees as I'm not sure how noisy that error_log is and how often it's non-empty. |
@ezalenski try with Also, please take a look at my test in the PR. |
I encountered the same issue. I'm solving it temporarily using XPath1.0 by prepending |
Hi @amirt01 If you provide the example URL, I would be thankful! |
Certainly @Constantin1489! I use changedetection.io to monitor company job sites like those hosted on Lever. I ran into this issue when filtering for the posting names: Here is an arbitrary example using Kinsta: |
I also came across this issue, it's reproducible in my machine. The CSS/JSONPath/JQ/XPath Filters is something like I'm solving it temporarily using XPath1.0 by prepending |
@leiless would you run the code by modifying the url?
|
@Constantin1489, there is the $ curl -fsSL $URL | xmllint --html - --debug 2> /dev/null | grep 'ELEMENT html' -C10
HTML DOCUMENT
encoding=utf-8
URL=-
standalone=true
DTD(html)
ELEMENT html
ATTRIBUTE xmlns
TEXT
content=http://www.w3.org/1999/xhtml
TEXT
content=
ELEMENT head
ELEMENT meta
ATTRIBUTE http-equiv
TEXT
content=Content-Type |
Please would you run the code without |
$ curl -fsSL $URL | xmllint --html - --debug 2> /dev/null | grep 'ELEMENT html'
ELEMENT html |
Yes, that is the problem I solved with the PR. |
Sometimes, I got this
But usually it's:
It's weird? |
@leiless can you include the URL? |
I'm not an expert. This is just my explanation. It would be wrong at some point. This is XPath1 spec said (https://www.w3.org/TR/1999/REC-xpath-19991116/#root-node)
This explanation is similar to how DOM or XDM looks like. The point is that the element node for the document element is a child of the root node. (root node != root element node) We send a document to xmllint, in this case, we can expect a fixed html document would have one html(root element node). Also, DOM is important. (https://www.w3.org/2008/08/cleantheweb/libxml)
So when we think of something no-more-fixable HTML(this is my term. Or means complete html. Also my term), It will have the same structure as DOM. So, when html4 specs says "html element is optional" is something like, "I know you open a html document in browser. I will make it lemon juice." and the browsers create an html element tag to fix the document. And the xpath1, xpath2, xpath3, and xpath3.1 expect only one root element node.
But the method I choose is, just to create So why this happens. (Again, this is just my explanation. It would be wrong at some point. But I'm easing my pain with it. That is why when only one html root element exists, xpath2-3.1 works. Internally the nodes are xml nodes. So BTW why then libxml2 has a HTML 4.0 non-verifying parser? I think it is what it is. Also I already reported this issue. You don't have to do. |
https://www.pdrcfw.com/OurNews.aspx XPath Filters: |
Yeah, definitely it's because the HTML source code is malformed (code after the final |
@leiless Also I tried again, today. Now, it shows. "Your access has been identified as an attack and logged"So, this can be the reason sometimes you get the one html root element. |
So, the non-blocked version(html source code of #2318 (comment) ) will show another root element is root element siblings. |
@Constantin1489 Great analysis! wonder if this bug will be fixed in the [next] release? |
I submitted my PR. However, the maintainer needs time to ensure the PR is the solution. |
https://www.pdrcfw.com/OurNews.aspx has correct but then...
|
All versions?
using this shared watch https://changedetection.io/share/QtZ-94DW41sa
'str' object has no attribute '__name__'
error.. i tried different lxml library versions but that made no differencehttps://www.depinte.be/werken and
//div[1]/div[1]/div[1]/div[1]/div[2]/div[1]/div[1]/div[1]/div[1]/div[1]
seems to come from here
changedetection.io/changedetectionio/html_tools.py
Line 128 in e110b3e
Likely it is
elementpath
relatedThe text was updated successfully, but these errors were encountered: