findspam.py: Add IP addresses to websites, too (#3722) #3789

tripleee · 2020-04-02T17:47:24Z

This crudely adds the blacklisted IP addresses to the blacklisted_websites regex (which should be fine) and the watched ones to watched_keywords (which may be slightly more dubious). It should fix #3722

makyen · 2020-04-02T18:49:38Z

findspam.py

        try:
            del cls.rule_watched_keywords.compiled_regex
        except AttributeError:
            pass
        cls.rule_watched_keywords.sanity_check()
        cls.rule_blacklisted_websites.regex = r"(?i)({})".format(
-            "|".join(GlobalVars.blacklisted_websites))
+            "|".join(GlobalVars.blacklisted_websites +
+                [regex.escape(x) for x in GlobalVars.blacklisted_nses]))
        try:


Shouldn't this be blacklisted_cidrs, rather than blacklisted_nses?

Whoa, thanks for noticing!

makyen · 2020-04-02T19:25:20Z

What is the goal in adding these to the watchlist/blacklist keywords? In other words, is there something that isn't being detected (e.g. IP addresses being used in text, which isn't currently detected)? Is the goal to just add an additional detection/weight when the post uses the raw IP address, rather than a domain name in a link?

Doesn't the IP address already get detected by "bad IP for hostname in {}" if the raw IP address is used in the HTML (or at least within a link)? Is problem this change is trying to resolve that if the IP address is being used in text, but not within a link, then it isn't being detected by the IP address check?

This looks like it will always apply two detection reasons if the post uses a raw watched/blacklisted IP address within the HTML, or at least within a link. Is that really what we want to happen? I'm not saying I'm against doing that, as the raw IP address being used in a link is likely a very strong indicator of spam, I'm just checking if that's what we want to always be doing, given that we usually try to avoid double-detecting things. Arguably, it's detecting two different things: the use of the IP address as text and the URL resolving to an IP address which is blacklisted.

While the following is a concern, it's probably just something we deal with, if it ever comes up:
Adding these directly to the watched/blacklisted keywords means it will be difficult/compute intensive for us to create an exclusion for some matches of the IP, if there ends up being some need to exclude some use of the IP address text. In other words, if we need to create an exclusion due to some strange FP match, then a naive implementation would result in the exclusion being tested for on every single watch/blacklist match, rather than just on the matches which match that IP, or we'll have to pre-process the list to add the exclusion). Such a need is, of course, hypothetical.

tripleee · 2020-04-02T19:28:52Z

Indeed, there are some posts with IP addresses which were not detected, as outlined in bug report #3722. I have probably not listed every time that happened; I know there was one just yesterday or the day before again, which caused me to revisit this bug and finally try to do something about it.

I don't think this will double the reasons anywhere; on the contrary, it plugs a hole where posts were evading the check when they used only an IP address in the link.

tripleee · 2020-04-02T19:36:23Z

Looking at recent hits for http://1 I notice that the Indonesian gambling spammer uses IP addresses fairly often, though none of these seem to be good examples. The one which was missed the other day was theirs IIRC, but probably had an IP address starting with a different number.

tripleee · 2020-04-03T08:19:41Z

The Circle CI failure is a recurring problem, it seems to fail spuriously in an unrelated chatcommands test but only part of the time. (My other pending PR failed at first, and then succeeded after rebase; here, we see the opposite.)

stale · 2020-05-05T11:07:47Z

This issue has been closed because it has had no recent activity. If this is still important, please add another comment and find someone with write permissions to reopen the issue. Thank you for your contributions.

user12986714 · 2020-05-28T01:32:25Z

So the issue looks like that raw ip addresses in posts are not currently subject to ip or asn list checks. The nature of those two problems appears to be the same. Take ip watchlist for example. As we can see, in findspam.py, in function watched_ip_for_url_hostname(), it calls ip_for_url_host():

return ip_for_url_host(s, site, GlobalVars.watched_cidrs)

Then in ip_for_url_host(), it calls dns_query():

        a = dns_query(hostname, 'a')
        if a is not None:
            # Do something

The problem is, if hostname is a raw ip, dns_query() will return None (cannot resolve).

    try:
        starttime = datetime.utcnow()
        answer = dns.resolver.query(label, qtype)
    except dns.exception.DNSException as exc:
        # Some logging
        return None

Hence all raw ip won't be inspected by that rule.
The solution proposed in this PR, appending ip black- /watchlist to keyword black- /watchlist, would solve the ip check problems, but not the asn ones, and is probably not very extensible (such as to cidr).
A way to fix this is to detect raw ip address in dns_query() (with a regex like [1-2]?[0-9]?[0-9]\.[1-2]?[0-9]?[0-9]\.[1-2]?[0-9]?[0-9]\.[1-2]?[0-9]?[0-9], and if detected, construct an answer with that ip, rather than treating it as a domain and hence fail to resolve.

makyen reviewed Apr 2, 2020

View reviewed changes

findspam.py: Add IP addresses to websites, too (Charcoal-SE#3722)

298966f

tripleee force-pushed the extend-ip-check branch from b66c54b to 298966f Compare April 3, 2020 07:35

tripleee mentioned this pull request Apr 3, 2020

Spurious failures in CI #3793

Closed

stale bot added the status: stale label May 4, 2020

stale bot closed this May 5, 2020

tripleee removed the status: stale label May 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

findspam.py: Add IP addresses to websites, too (#3722) #3789

findspam.py: Add IP addresses to websites, too (#3722) #3789

tripleee commented Apr 2, 2020 •

edited

Loading

makyen Apr 2, 2020 •

edited

Loading

tripleee Apr 2, 2020

makyen commented Apr 2, 2020

tripleee commented Apr 2, 2020 •

edited

Loading

tripleee commented Apr 2, 2020

tripleee commented Apr 3, 2020

stale bot commented May 5, 2020

user12986714 commented May 28, 2020

findspam.py: Add IP addresses to websites, too (#3722) #3789

findspam.py: Add IP addresses to websites, too (#3722) #3789

Conversation

tripleee commented Apr 2, 2020 • edited Loading

makyen Apr 2, 2020 • edited Loading

Choose a reason for hiding this comment

tripleee Apr 2, 2020

Choose a reason for hiding this comment

makyen commented Apr 2, 2020

tripleee commented Apr 2, 2020 • edited Loading

tripleee commented Apr 2, 2020

tripleee commented Apr 3, 2020

stale bot commented May 5, 2020

user12986714 commented May 28, 2020

tripleee commented Apr 2, 2020 •

edited

Loading

makyen Apr 2, 2020 •

edited

Loading

tripleee commented Apr 2, 2020 •

edited

Loading