Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for wggesucht crawler to only consider the desired listings #516

Merged
merged 5 commits into from
Jan 24, 2024
Merged

Fix for wggesucht crawler to only consider the desired listings #516

merged 5 commits into from
Jan 24, 2024

Conversation

alvarnydev
Copy link

I've started my flathunting journey today and noticed that on wg-gesucht.de some (very old) listings were presented to me various times even though they were not even included in what I was searching for. Turns out wg-gesucht includes additional results in a div they label "premium_user_extra_list".
Those additional results are currently not filtered out because only the elements themselves are checked for the correct id of "liste-", which wouldn't be a big deal if they are included once in the beginning and then blocked on subsequent refreshs thanks to the to id_maintainer. But unfortunately the additional listings are somewhat random on each page refresh, presenting you at times with listings that are months old.

This PR changes the filter to only consider the desired search results by looking at the parent elements id and making sure it's 'main_column' and not 'premium_user_extra_list', which had the additional results.

On the page:
Screenshot 2024-01-17 at 17 03 07

In HTML:
Screenshot 2024-01-17 at 17 05 34

@codders
Copy link

codders commented Jan 17, 2024

Amazing - thanks for the initiative. Great to have new developers on board :)

The type checker and the linter have a couple of comments - can you look at those?

@alvarnydev
Copy link
Author

Definitely, let me check em out! Thanks for all your amazing work btw :)

@alvarnydev
Copy link
Author

alvarnydev commented Jan 17, 2024

Ok can you help me out with those? I'm not really a python dev and struggle to figure it out.

The linter exits with exit code 28 which to me (after reading this SO post) signals to me that warning, refactor and convention messages were issued. How and where though? The changes are so minimal 😅

The type checker fails because it doesn't know where 'attrs' comes from, but that was already in use on the element so that has me wondering why it fails now. Also how would I define a type for that?

And finally, the test fails because it asserts a len of 20 for the wg-gesucht listings. I constantly get 21 though (or 27 with the additional listings) so I think they just decided to add one to the regular ones which would have us need to change the test logic. Should I do that?

if not isinstance(element, Tag):
return False
if "id" not in element.attrs:
return False
return element.attrs["id"].startswith('liste-')
if "id" not in element.parent.attrs:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Problem here is that 'element.parent' may be 'None' / null, so 'pyright' complains that you can't rely on 'element.parent.attrs' existing.

You need a check here with ìf element.parent is not None and "id" not in element.parent.attrs`

@@ -148,12 +148,14 @@ def parse_expose_element_to_details(row: Tag, crawler: str) -> Optional[Dict]:


def liste_attribute_filter(element: Union[Tag, str]) -> bool:
"""Return true for elements whose 'id' attribute starts with 'liste-'"""
"""Return true for elements whose 'id' attribute starts with 'liste-' and are contained in a parent with an 'id' attribute of 'main_column'"""
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The linter complains here that the line is too long. It needs to wrap at 100 characters.

@alvarnydev
Copy link
Author

alvarnydev commented Jan 17, 2024

Thanks for the tips! I also just noticed that my fix breaks the wg-gesucht search on another page because they use a different layout depending on whether you look for rooms or flats. And when looking for flats they introduce another wrapper element instead of being directly under the 'main_column' div. Uggh. Let me figure this out and do it right, before you merge.

Rooms:
rooms

Flats:
flats

@alvarnydev
Copy link
Author

Ok so I reversed the logic so that it now excludes results in the 'premium_user_extra_list' container so that it now does not affect other pages. Also I changed the code to adhere to the linter.

Still, the test is gonna fail because the search for 'flats' returns 20 ads whereas the search for 'rooms' returns 21 ads.

I noticed that the search for 'flats' is generally broken at the moment (not only due to my change) because of their differing, maybe new layout. I think this should be handled in another PR though, as this is contextually different from this PRs goal.

@alvarnydev
Copy link
Author

alvarnydev commented Jan 20, 2024

To be honest I can't figure out what's wrong with my code. I don't understand why me checking for the 'class' attribute on the parent results in a len(entries) of 0 in the tests when it clearly exists (see screenshot below). It's also a mystery to me how the change works for me in my live search but does not pass the test. Anyway, I'm closing this for now as I can't figure it out.

Screenshot 2024-01-20 at 10 46 19

Would be nice if someone else managed to incorporate the fix in a way that can pass the tests.

@alvarnydev alvarnydev closed this Jan 20, 2024
@ngdio
Copy link

ngdio commented Jan 21, 2024

Okay, I've had a look at this, as I was curious. There's a flaw in your logic. Yes, the premium ads you're successfully out have a class attribute on the parent, but the regular ads in the failed test do not. Since you're checking for the class attribute, they're collectively filtered out, even though they're fine and should go through. I don't think there's a need to make that check, is there?

As the test grabs the website from a fixed file, it might differ from the current layout used by WG-Gesucht, so that would explain why it worked everywhere else than in the test. Nevertheless, it should pass.

@alvarnydev
Copy link
Author

alvarnydev commented Jan 21, 2024

Thanks for taking the time to look into this and you're absolutely correct that there must be a flaw somewhere in my logic, but I don't see where. The regular ads do have a parent with a class attribute, every one. That's why I don't get it and eventually moved on. Take a look at the screenshots (the first showing the structure of the fixture, the second the structure of the TEST_URL), they're all under the div with the id main_column which has lots of styling classes.

wg-gesucht-spotahome.html

Screenshot 2024-01-21 at 19 28 02

TEST_URL

Screenshot 2024-01-21 at 19 32 15

I do think that the test for the existence of the class attribute attribute is necessary as I want to access it when filtering (also I'm 90% the typing tests would fail otherwise).

@ngdio
Copy link

ngdio commented Jan 21, 2024

Right, I didn't have a close enough look at the HTML file, you're absolutely right. I've tracked it down to the Python HTML parser, which doesn't seem to like a self-closing link tag, which is parsed as the incorrect parent tag and doesn't have a class attribute. Your code was fully correct, as with lxml as a parser it works perfectly. That's another C-based dependency which we don't really want (edit: it's a dependency anyway so it doesn't matter), but the inbuilt parser doesn't seem to be configurable. Perhaps worth switching BeautifulSoup's parser from html.parser to lxml in all modules just for the sake of reliability, the Python one seems quite bad with simple mistakes like this one.

@alvarnydev
Copy link
Author

alvarnydev commented Jan 22, 2024

Man, thanks for taking the time to get to the root of this. I was seriously going bonkers because this was supposed to be a quick fix and I just couldn't understand where I went wrong. So this has me feeling a bit relieved that it comes from a parsing issue somewhere internally in BS, rather than me just not getting it.

Do you have an idea on how to resolve this (that would work for the current parsing setup?

@ngdio
Copy link

ngdio commented Jan 22, 2024

Could remove the link tags from the test fixture as they're irrelevant to it anyway. But since the html parser fails at such a simple task I'd argue in favour of just switching to lxml altogether (302a223). It's already a dependency.

@alvarnydev
Copy link
Author

Sounds good to me, I'll reopen this then once the switch is made

@codders
Copy link

codders commented Jan 22, 2024

Merged the changed to lxml in #519. Hope that helps!

@alvarnydev
Copy link
Author

Great, I'll look into it the coming days

@alvarnydev alvarnydev reopened this Jan 24, 2024
@alvarnydev
Copy link
Author

The new parser did wonders and works now 👍 Thanks for merging already

@codders
Copy link

codders commented Jan 24, 2024

Ooops! didn't actually mean to press that. Can you re-open it so I can review?

@alvarnydev
Copy link
Author

Actually I don't think I can but I can make a new PR?

@codders
Copy link

codders commented Jan 24, 2024

It's fine. I made #521 to revert my revert :) I'll review and merge that when it passes the tests. Thanks for re-opening the PR!

@alvarnydev
Copy link
Author

Haha ok great, I also ran all the tests locally so should be fine :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants