findspam.py: refactor and unify link extraction (and perhaps overall post handling) #2500
Labels
area: spamchecks
Detections or the process of testing posts. (No space in the label, is because of Hacktoberfest)
status: confirmed
Confirmed as something that needs working on.
type: refactor
There are multiple overlapping and sometimes conflicting attempts to enumerate all the links in a post in
findspam.py
. See below for a sampling.We should unify these, and ideally reduce the number of times we iterate over the message text looking for more or less the same information.
An object-oriented approach to the entire problem would seem like a natural but somewhat involved solution. Instead of scanning the raw text of the post over and over, use the
_Post
object we already have to store the links once, and then just use the object's methods to retrieve the links you want.This should also make it easier to keep information about a post's features between different methods in
findspam.py
which look for distinct but related features in a post.Just to give you an idea of the scope of the problem, here are a few of the methods which attempt to analyze links.
misleading_link
looks for an<a href=...">...</a>
snippet using a fairly straightforward regex.link_at_end
has a different regex which looks for a URL in the anchor text, with some simple whitelisting of a few domains.non_english_link
has a similar structure, with a different regex, without whitelisting.keyword_link
has a fairly simple regex for the actual link, and a messy regex over a few lines of strings for the keywords in the anchor text.bad_link_text
has a different link, more similar to thelink_at_end
/non_english_link
regex, and a bit over half a dozen lines of messy "bad keyword" regex.bad_pattern_in_url
has a different version of the<a href=...>...</a>
regex, and two lines of bad patterns. (Confession: This one is mine.)post_links
is a helper method used by several other methods. It tries to enumerate links using the globalURL_REGEX
. This might be a good starting point for unifying the above methods, but needs to be significantly more complex if it should completely replace all the ad-hoc regexes in those other methods.The text was updated successfully, but these errors were encountered: