Fix for wggesucht crawler to only consider the desired listings #516

alvarnydev · 2024-01-17T16:12:19Z

I've started my flathunting journey today and noticed that on wg-gesucht.de some (very old) listings were presented to me various times even though they were not even included in what I was searching for. Turns out wg-gesucht includes additional results in a div they label "premium_user_extra_list".
Those additional results are currently not filtered out because only the elements themselves are checked for the correct id of "liste-", which wouldn't be a big deal if they are included once in the beginning and then blocked on subsequent refreshs thanks to the to id_maintainer. But unfortunately the additional listings are somewhat random on each page refresh, presenting you at times with listings that are months old.

This PR changes the filter to only consider the desired search results by looking at the parent elements id and making sure it's 'main_column' and not 'premium_user_extra_list', which had the additional results.

On the page:

In HTML:

codders · 2024-01-17T16:16:09Z

Amazing - thanks for the initiative. Great to have new developers on board :)

The type checker and the linter have a couple of comments - can you look at those?

alvarnydev · 2024-01-17T16:19:12Z

Definitely, let me check em out! Thanks for all your amazing work btw :)

alvarnydev · 2024-01-17T16:35:27Z

Ok can you help me out with those? I'm not really a python dev and struggle to figure it out.

The linter exits with exit code 28 which to me (after reading this SO post) signals to me that warning, refactor and convention messages were issued. How and where though? The changes are so minimal 😅

The type checker fails because it doesn't know where 'attrs' comes from, but that was already in use on the element so that has me wondering why it fails now. Also how would I define a type for that?

And finally, the test fails because it asserts a len of 20 for the wg-gesucht listings. I constantly get 21 though (or 27 with the additional listings) so I think they just decided to add one to the regular ones which would have us need to change the test logic. Should I do that?

codders · 2024-01-17T16:51:27Z

flathunter/crawler/wggesucht.py

    if not isinstance(element, Tag):
        return False
    if "id" not in element.attrs:
        return False
-    return element.attrs["id"].startswith('liste-')
+    if "id" not in element.parent.attrs:


Problem here is that 'element.parent' may be 'None' / null, so 'pyright' complains that you can't rely on 'element.parent.attrs' existing.

You need a check here with ìf element.parent is not None and "id" not in element.parent.attrs`

codders · 2024-01-17T16:52:19Z

flathunter/crawler/wggesucht.py

@@ -148,12 +148,14 @@ def parse_expose_element_to_details(row: Tag, crawler: str) -> Optional[Dict]:


 def liste_attribute_filter(element: Union[Tag, str]) -> bool:
-    """Return true for elements whose 'id' attribute starts with 'liste-'"""
+    """Return true for elements whose 'id' attribute starts with 'liste-' and are contained in a parent with an 'id' attribute of 'main_column'"""


The linter complains here that the line is too long. It needs to wrap at 100 characters.

alvarnydev · 2024-01-17T17:01:43Z

Thanks for the tips! I also just noticed that my fix breaks the wg-gesucht search on another page because they use a different layout depending on whether you look for rooms or flats. And when looking for flats they introduce another wrapper element instead of being directly under the 'main_column' div. Uggh. Let me figure this out and do it right, before you merge.

Rooms:

Flats:

alvarnydev · 2024-01-17T17:31:33Z

Ok so I reversed the logic so that it now excludes results in the 'premium_user_extra_list' container so that it now does not affect other pages. Also I changed the code to adhere to the linter.

Still, the test is gonna fail because the search for 'flats' returns 20 ads whereas the search for 'rooms' returns 21 ads.

I noticed that the search for 'flats' is generally broken at the moment (not only due to my change) because of their differing, maybe new layout. I think this should be handled in another PR though, as this is contextually different from this PRs goal.

alvarnydev · 2024-01-20T09:52:53Z

To be honest I can't figure out what's wrong with my code. I don't understand why me checking for the 'class' attribute on the parent results in a len(entries) of 0 in the tests when it clearly exists (see screenshot below). It's also a mystery to me how the change works for me in my live search but does not pass the test. Anyway, I'm closing this for now as I can't figure it out.

Would be nice if someone else managed to incorporate the fix in a way that can pass the tests.

ngdio · 2024-01-21T18:16:52Z

Okay, I've had a look at this, as I was curious. There's a flaw in your logic. Yes, the premium ads you're successfully out have a class attribute on the parent, but the regular ads in the failed test do not. Since you're checking for the class attribute, they're collectively filtered out, even though they're fine and should go through. I don't think there's a need to make that check, is there?

As the test grabs the website from a fixed file, it might differ from the current layout used by WG-Gesucht, so that would explain why it worked everywhere else than in the test. Nevertheless, it should pass.

alvarnydev · 2024-01-21T18:34:52Z

Thanks for taking the time to look into this and you're absolutely correct that there must be a flaw somewhere in my logic, but I don't see where. The regular ads do have a parent with a class attribute, every one. That's why I don't get it and eventually moved on. Take a look at the screenshots (the first showing the structure of the fixture, the second the structure of the TEST_URL), they're all under the div with the id main_column which has lots of styling classes.

wg-gesucht-spotahome.html

TEST_URL

I do think that the test for the existence of the class attribute attribute is necessary as I want to access it when filtering (also I'm 90% the typing tests would fail otherwise).

ngdio · 2024-01-21T22:28:50Z

Right, I didn't have a close enough look at the HTML file, you're absolutely right. I've tracked it down to the Python HTML parser, which doesn't seem to like a self-closing link tag, which is parsed as the incorrect parent tag and doesn't have a class attribute. Your code was fully correct, as with lxml as a parser it works perfectly. That's another C-based dependency which we don't really want (edit: it's a dependency anyway so it doesn't matter), but the inbuilt parser doesn't seem to be configurable. Perhaps worth switching BeautifulSoup's parser from html.parser to lxml in all modules just for the sake of reliability, the Python one seems quite bad with simple mistakes like this one.

alvarnydev · 2024-01-22T06:52:22Z

Man, thanks for taking the time to get to the root of this. I was seriously going bonkers because this was supposed to be a quick fix and I just couldn't understand where I went wrong. So this has me feeling a bit relieved that it comes from a parsing issue somewhere internally in BS, rather than me just not getting it.

Do you have an idea on how to resolve this (that would work for the current parsing setup?

ngdio · 2024-01-22T10:23:42Z

Could remove the link tags from the test fixture as they're irrelevant to it anyway. But since the html parser fails at such a simple task I'd argue in favour of just switching to lxml altogether (302a223). It's already a dependency.

alvarnydev · 2024-01-22T10:39:29Z

Sounds good to me, I'll reopen this then once the switch is made

codders · 2024-01-22T16:40:44Z

Merged the changed to lxml in #519. Hope that helps!

alvarnydev · 2024-01-23T09:37:38Z

Great, I'll look into it the coming days

alvarnydev · 2024-01-24T15:53:47Z

The new parser did wonders and works now 👍 Thanks for merging already

codders · 2024-01-24T15:54:18Z

Ooops! didn't actually mean to press that. Can you re-open it so I can review?

alvarnydev · 2024-01-24T15:55:01Z

Actually I don't think I can but I can make a new PR?

codders · 2024-01-24T15:57:25Z

It's fine. I made #521 to revert my revert :) I'll review and merge that when it passes the tests. Thanks for re-opening the PR!

alvarnydev · 2024-01-24T15:57:57Z

Haha ok great, I also ran all the tests locally so should be fine :)

Fix for wggesucht crawler to only consider appropriate listings

14164f1

alvarnydev had a problem deploying to test January 17, 2024 16:14 — with GitHub Actions Failure

alvarnydev had a problem deploying to test January 17, 2024 16:14 — with GitHub Actions Error

alvarnydev had a problem deploying to test January 17, 2024 16:14 — with GitHub Actions Failure

codders requested changes Jan 17, 2024

View reviewed changes

Yannik Sahl added 2 commits January 17, 2024 18:34

Inverse logic to exclude the premium list, type checks

ba71395

Merge branch 'flathunters:main' into main

3a8dec8

alvarnydev had a problem deploying to test January 18, 2024 11:56 — with GitHub Actions Failure

alvarnydev temporarily deployed to test January 18, 2024 11:56 — with GitHub Actions Inactive

alvarnydev had a problem deploying to test January 18, 2024 11:56 — with GitHub Actions Failure

alvarnydev requested a review from codders January 18, 2024 20:53

alvarnydev closed this Jan 20, 2024

alvarnydev reopened this Jan 20, 2024

alvarnydev closed this Jan 20, 2024

codders mentioned this pull request Jan 22, 2024

Switch html.parser for lxml in BeautifulSoup invocations #519

Merged

Merge branch 'flathunters:main' into main

df61d68

alvarnydev reopened this Jan 24, 2024

comment update to pass linter

c4e9b4e

codders merged commit 8c2f53e into flathunters:main Jan 24, 2024

codders mentioned this pull request Jan 24, 2024

Revert "Fix for wggesucht crawler to only consider the desired listings" #520

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for wggesucht crawler to only consider the desired listings #516

Fix for wggesucht crawler to only consider the desired listings #516

alvarnydev commented Jan 17, 2024

codders commented Jan 17, 2024

alvarnydev commented Jan 17, 2024

alvarnydev commented Jan 17, 2024 •

edited

Loading

codders Jan 17, 2024

codders Jan 17, 2024

alvarnydev commented Jan 17, 2024 •

edited

Loading

alvarnydev commented Jan 17, 2024

alvarnydev commented Jan 20, 2024 •

edited

Loading

ngdio commented Jan 21, 2024

alvarnydev commented Jan 21, 2024 •

edited

Loading

ngdio commented Jan 21, 2024 •

edited

Loading

alvarnydev commented Jan 22, 2024 •

edited

Loading

ngdio commented Jan 22, 2024

alvarnydev commented Jan 22, 2024

codders commented Jan 22, 2024

alvarnydev commented Jan 23, 2024

alvarnydev commented Jan 24, 2024

codders commented Jan 24, 2024

alvarnydev commented Jan 24, 2024

codders commented Jan 24, 2024

alvarnydev commented Jan 24, 2024

Fix for wggesucht crawler to only consider the desired listings #516

Fix for wggesucht crawler to only consider the desired listings #516

Conversation

alvarnydev commented Jan 17, 2024

codders commented Jan 17, 2024

alvarnydev commented Jan 17, 2024

alvarnydev commented Jan 17, 2024 • edited Loading

codders Jan 17, 2024

Choose a reason for hiding this comment

codders Jan 17, 2024

Choose a reason for hiding this comment

alvarnydev commented Jan 17, 2024 • edited Loading

alvarnydev commented Jan 17, 2024

alvarnydev commented Jan 20, 2024 • edited Loading

ngdio commented Jan 21, 2024

alvarnydev commented Jan 21, 2024 • edited Loading

ngdio commented Jan 21, 2024 • edited Loading

alvarnydev commented Jan 22, 2024 • edited Loading

ngdio commented Jan 22, 2024

alvarnydev commented Jan 22, 2024

codders commented Jan 22, 2024

alvarnydev commented Jan 23, 2024

alvarnydev commented Jan 24, 2024

codders commented Jan 24, 2024

alvarnydev commented Jan 24, 2024

codders commented Jan 24, 2024

alvarnydev commented Jan 24, 2024

alvarnydev commented Jan 17, 2024 •

edited

Loading

alvarnydev commented Jan 17, 2024 •

edited

Loading

alvarnydev commented Jan 20, 2024 •

edited

Loading

alvarnydev commented Jan 21, 2024 •

edited

Loading

ngdio commented Jan 21, 2024 •

edited

Loading

alvarnydev commented Jan 22, 2024 •

edited

Loading