-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
findspam.py: Add IP addresses to websites, too (#3722) #3789
Conversation
try: | ||
del cls.rule_watched_keywords.compiled_regex | ||
except AttributeError: | ||
pass | ||
cls.rule_watched_keywords.sanity_check() | ||
cls.rule_blacklisted_websites.regex = r"(?i)({})".format( | ||
"|".join(GlobalVars.blacklisted_websites)) | ||
"|".join(GlobalVars.blacklisted_websites + | ||
[regex.escape(x) for x in GlobalVars.blacklisted_nses])) | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be blacklisted_cidrs
, rather than blacklisted_nses
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoa, thanks for noticing!
What is the goal in adding these to the watchlist/blacklist keywords? In other words, is there something that isn't being detected (e.g. IP addresses being used in text, which isn't currently detected)? Is the goal to just add an additional detection/weight when the post uses the raw IP address, rather than a domain name in a link? Doesn't the IP address already get detected by "bad IP for hostname in {}" if the raw IP address is used in the HTML (or at least within a link)? Is problem this change is trying to resolve that if the IP address is being used in text, but not within a link, then it isn't being detected by the IP address check? This looks like it will always apply two detection reasons if the post uses a raw watched/blacklisted IP address within the HTML, or at least within a link. Is that really what we want to happen? I'm not saying I'm against doing that, as the raw IP address being used in a link is likely a very strong indicator of spam, I'm just checking if that's what we want to always be doing, given that we usually try to avoid double-detecting things. Arguably, it's detecting two different things: the use of the IP address as text and the URL resolving to an IP address which is blacklisted. While the following is a concern, it's probably just something we deal with, if it ever comes up: |
Indeed, there are some posts with IP addresses which were not detected, as outlined in bug report #3722. I have probably not listed every time that happened; I know there was one just yesterday or the day before again, which caused me to revisit this bug and finally try to do something about it. I don't think this will double the reasons anywhere; on the contrary, it plugs a hole where posts were evading the check when they used only an IP address in the link. |
Looking at recent hits for |
b66c54b
to
298966f
Compare
The Circle CI failure is a recurring problem, it seems to fail spuriously in an unrelated |
This issue has been closed because it has had no recent activity. If this is still important, please add another comment and find someone with write permissions to reopen the issue. Thank you for your contributions. |
So the issue looks like that raw ip addresses in posts are not currently subject to ip or asn list checks. The nature of those two problems appears to be the same. Take ip watchlist for example. As we can see, in findspam.py, in function return ip_for_url_host(s, site, GlobalVars.watched_cidrs) Then in a = dns_query(hostname, 'a')
if a is not None:
# Do something The problem is, if try:
starttime = datetime.utcnow()
answer = dns.resolver.query(label, qtype)
except dns.exception.DNSException as exc:
# Some logging
return None Hence all raw ip won't be inspected by that rule. |
This crudely adds the blacklisted IP addresses to the blacklisted_websites regex (which should be fine) and the watched ones to watched_keywords (which may be slightly more dubious). It should fix #3722