Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

findspam.py: Add IP addresses to websites, too (#3722) #3789

Closed
wants to merge 1 commit into from

Conversation

tripleee
Copy link
Member

@tripleee tripleee commented Apr 2, 2020

This crudely adds the blacklisted IP addresses to the blacklisted_websites regex (which should be fine) and the watched ones to watched_keywords (which may be slightly more dubious). It should fix #3722

try:
del cls.rule_watched_keywords.compiled_regex
except AttributeError:
pass
cls.rule_watched_keywords.sanity_check()
cls.rule_blacklisted_websites.regex = r"(?i)({})".format(
"|".join(GlobalVars.blacklisted_websites))
"|".join(GlobalVars.blacklisted_websites +
[regex.escape(x) for x in GlobalVars.blacklisted_nses]))
try:
Copy link
Contributor

@makyen makyen Apr 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be blacklisted_cidrs, rather than blacklisted_nses?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoa, thanks for noticing!

@makyen
Copy link
Contributor

makyen commented Apr 2, 2020

What is the goal in adding these to the watchlist/blacklist keywords? In other words, is there something that isn't being detected (e.g. IP addresses being used in text, which isn't currently detected)? Is the goal to just add an additional detection/weight when the post uses the raw IP address, rather than a domain name in a link?

Doesn't the IP address already get detected by "bad IP for hostname in {}" if the raw IP address is used in the HTML (or at least within a link)? Is problem this change is trying to resolve that if the IP address is being used in text, but not within a link, then it isn't being detected by the IP address check?

This looks like it will always apply two detection reasons if the post uses a raw watched/blacklisted IP address within the HTML, or at least within a link. Is that really what we want to happen? I'm not saying I'm against doing that, as the raw IP address being used in a link is likely a very strong indicator of spam, I'm just checking if that's what we want to always be doing, given that we usually try to avoid double-detecting things. Arguably, it's detecting two different things: the use of the IP address as text and the URL resolving to an IP address which is blacklisted.

While the following is a concern, it's probably just something we deal with, if it ever comes up:
Adding these directly to the watched/blacklisted keywords means it will be difficult/compute intensive for us to create an exclusion for some matches of the IP, if there ends up being some need to exclude some use of the IP address text. In other words, if we need to create an exclusion due to some strange FP match, then a naive implementation would result in the exclusion being tested for on every single watch/blacklist match, rather than just on the matches which match that IP, or we'll have to pre-process the list to add the exclusion). Such a need is, of course, hypothetical.

@tripleee
Copy link
Member Author

tripleee commented Apr 2, 2020

Indeed, there are some posts with IP addresses which were not detected, as outlined in bug report #3722. I have probably not listed every time that happened; I know there was one just yesterday or the day before again, which caused me to revisit this bug and finally try to do something about it.

I don't think this will double the reasons anywhere; on the contrary, it plugs a hole where posts were evading the check when they used only an IP address in the link.

@tripleee
Copy link
Member Author

tripleee commented Apr 2, 2020

Looking at recent hits for http://1 I notice that the Indonesian gambling spammer uses IP addresses fairly often, though none of these seem to be good examples. The one which was missed the other day was theirs IIRC, but probably had an IP address starting with a different number.

@tripleee
Copy link
Member Author

tripleee commented Apr 3, 2020

The Circle CI failure is a recurring problem, it seems to fail spuriously in an unrelated chatcommands test but only part of the time. (My other pending PR failed at first, and then succeeded after rebase; here, we see the opposite.)

@tripleee tripleee mentioned this pull request Apr 3, 2020
@stale stale bot added the status: stale label May 4, 2020
@stale
Copy link

stale bot commented May 5, 2020

This issue has been closed because it has had no recent activity. If this is still important, please add another comment and find someone with write permissions to reopen the issue. Thank you for your contributions.

@stale stale bot closed this May 5, 2020
@user12986714
Copy link
Contributor

So the issue looks like that raw ip addresses in posts are not currently subject to ip or asn list checks. The nature of those two problems appears to be the same. Take ip watchlist for example. As we can see, in findspam.py, in function watched_ip_for_url_hostname(), it calls ip_for_url_host():

return ip_for_url_host(s, site, GlobalVars.watched_cidrs)

Then in ip_for_url_host(), it calls dns_query():

        a = dns_query(hostname, 'a')
        if a is not None:
            # Do something

The problem is, if hostname is a raw ip, dns_query() will return None (cannot resolve).

    try:
        starttime = datetime.utcnow()
        answer = dns.resolver.query(label, qtype)
    except dns.exception.DNSException as exc:
        # Some logging
        return None

Hence all raw ip won't be inspected by that rule.
The solution proposed in this PR, appending ip black- /watchlist to keyword black- /watchlist, would solve the ip check problems, but not the asn ones, and is probably not very extensible (such as to cidr).
A way to fix this is to detect raw ip address in dns_query() (with a regex like [1-2]?[0-9]?[0-9]\.[1-2]?[0-9]?[0-9]\.[1-2]?[0-9]?[0-9]\.[1-2]?[0-9]?[0-9], and if detected, construct an answer with that ip, rather than treating it as a domain and hence fail to resolve.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Watched/blacklisted IP should trigger on IP-only URLs
3 participants