-
Notifications
You must be signed in to change notification settings - Fork 187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for wggesucht crawler to only consider the desired listings #516
Conversation
Amazing - thanks for the initiative. Great to have new developers on board :) The type checker and the linter have a couple of comments - can you look at those? |
Definitely, let me check em out! Thanks for all your amazing work btw :) |
Ok can you help me out with those? I'm not really a python dev and struggle to figure it out. The linter exits with exit code 28 which to me (after reading this SO post) signals to me that warning, refactor and convention messages were issued. How and where though? The changes are so minimal 😅 The type checker fails because it doesn't know where 'attrs' comes from, but that was already in use on the element so that has me wondering why it fails now. Also how would I define a type for that? And finally, the test fails because it asserts a len of 20 for the wg-gesucht listings. I constantly get 21 though (or 27 with the additional listings) so I think they just decided to add one to the regular ones which would have us need to change the test logic. Should I do that? |
flathunter/crawler/wggesucht.py
Outdated
if not isinstance(element, Tag): | ||
return False | ||
if "id" not in element.attrs: | ||
return False | ||
return element.attrs["id"].startswith('liste-') | ||
if "id" not in element.parent.attrs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Problem here is that 'element.parent' may be 'None' / null, so 'pyright' complains that you can't rely on 'element.parent.attrs' existing.
You need a check here with ìf element.parent is not None and "id" not in element.parent.attrs`
flathunter/crawler/wggesucht.py
Outdated
@@ -148,12 +148,14 @@ def parse_expose_element_to_details(row: Tag, crawler: str) -> Optional[Dict]: | |||
|
|||
|
|||
def liste_attribute_filter(element: Union[Tag, str]) -> bool: | |||
"""Return true for elements whose 'id' attribute starts with 'liste-'""" | |||
"""Return true for elements whose 'id' attribute starts with 'liste-' and are contained in a parent with an 'id' attribute of 'main_column'""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The linter complains here that the line is too long. It needs to wrap at 100 characters.
Thanks for the tips! I also just noticed that my fix breaks the wg-gesucht search on another page because they use a different layout depending on whether you look for rooms or flats. And when looking for flats they introduce another wrapper element instead of being directly under the 'main_column' div. Uggh. Let me figure this out and do it right, before you merge. |
Ok so I reversed the logic so that it now excludes results in the 'premium_user_extra_list' container so that it now does not affect other pages. Also I changed the code to adhere to the linter. Still, the test is gonna fail because the search for 'flats' returns 20 ads whereas the search for 'rooms' returns 21 ads. I noticed that the search for 'flats' is generally broken at the moment (not only due to my change) because of their differing, maybe new layout. I think this should be handled in another PR though, as this is contextually different from this PRs goal. |
Okay, I've had a look at this, as I was curious. There's a flaw in your logic. Yes, the premium ads you're successfully out have a As the test grabs the website from a fixed file, it might differ from the current layout used by WG-Gesucht, so that would explain why it worked everywhere else than in the test. Nevertheless, it should pass. |
Right, I didn't have a close enough look at the HTML file, you're absolutely right. I've tracked it down to the Python HTML parser, which doesn't seem to like a self-closing link tag, which is parsed as the incorrect parent tag and doesn't have a class attribute. Your code was fully correct, as with lxml as a parser it works perfectly. That's another C-based dependency which we don't really want (edit: it's a dependency anyway so it doesn't matter), but the inbuilt parser doesn't seem to be configurable. Perhaps worth switching BeautifulSoup's parser from |
Man, thanks for taking the time to get to the root of this. I was seriously going bonkers because this was supposed to be a quick fix and I just couldn't understand where I went wrong. So this has me feeling a bit relieved that it comes from a parsing issue somewhere internally in BS, rather than me just not getting it. Do you have an idea on how to resolve this (that would work for the current parsing setup? |
Could remove the link tags from the test fixture as they're irrelevant to it anyway. But since the html parser fails at such a simple task I'd argue in favour of just switching to lxml altogether (302a223). It's already a dependency. |
Sounds good to me, I'll reopen this then once the switch is made |
Merged the changed to lxml in #519. Hope that helps! |
Great, I'll look into it the coming days |
The new parser did wonders and works now 👍 Thanks for merging already |
Ooops! didn't actually mean to press that. Can you re-open it so I can review? |
Actually I don't think I can but I can make a new PR? |
It's fine. I made #521 to revert my revert :) I'll review and merge that when it passes the tests. Thanks for re-opening the PR! |
Haha ok great, I also ran all the tests locally so should be fine :) |
I've started my flathunting journey today and noticed that on wg-gesucht.de some (very old) listings were presented to me various times even though they were not even included in what I was searching for. Turns out wg-gesucht includes additional results in a div they label "premium_user_extra_list".
Those additional results are currently not filtered out because only the elements themselves are checked for the correct id of "liste-", which wouldn't be a big deal if they are included once in the beginning and then blocked on subsequent refreshs thanks to the to id_maintainer. But unfortunately the additional listings are somewhat random on each page refresh, presenting you at times with listings that are months old.
This PR changes the filter to only consider the desired search results by looking at the parent elements id and making sure it's 'main_column' and not 'premium_user_extra_list', which had the additional results.
On the page:
In HTML: