findspam.py: refactor and unify link extraction (and perhaps overall post handling) #2500

tripleee · 2018-10-29T10:03:09Z

There are multiple overlapping and sometimes conflicting attempts to enumerate all the links in a post in findspam.py. See below for a sampling.

We should unify these, and ideally reduce the number of times we iterate over the message text looking for more or less the same information.

An object-oriented approach to the entire problem would seem like a natural but somewhat involved solution. Instead of scanning the raw text of the post over and over, use the _Post object we already have to store the links once, and then just use the object's methods to retrieve the links you want.

This should also make it easier to keep information about a post's features between different methods in findspam.py which look for distinct but related features in a post.

Just to give you an idea of the scope of the problem, here are a few of the methods which attempt to analyze links.

misleading_link looks for an <a href=...">...</a> snippet using a fairly straightforward regex.
link_at_end has a different regex which looks for a URL in the anchor text, with some simple whitelisting of a few domains.
non_english_link has a similar structure, with a different regex, without whitelisting.
keyword_link has a fairly simple regex for the actual link, and a messy regex over a few lines of strings for the keywords in the anchor text.
bad_link_text has a different link, more similar to the link_at_end/non_english_link regex, and a bit over half a dozen lines of messy "bad keyword" regex.
bad_pattern_in_url has a different version of the <a href=...>...</a> regex, and two lines of bad patterns. (Confession: This one is mine.)
post_links is a helper method used by several other methods. It tries to enumerate links using the global URL_REGEX. This might be a good starting point for unifying the above methods, but needs to be significantly more complex if it should completely replace all the ad-hoc regexes in those other methods.

The text was updated successfully, but these errors were encountered:

stale · 2019-10-29T23:53:11Z

This issue has been closed because it has had no recent activity. If this is still important, please add another comment and find someone with write permissions to reopen the issue. Thank you for your contributions.

angussidney added area: spamchecks Detections or the process of testing posts. (No space in the label, is because of Hacktoberfest) type: refactor labels Nov 1, 2018

stale bot added the status: stale label Oct 25, 2019

stale bot closed this as completed Oct 29, 2019

makyen added the status: confirmed Confirmed as something that needs working on. label Mar 1, 2020

makyen reopened this Mar 1, 2020

stale bot removed the status: stale label Mar 1, 2020

user12986714 mentioned this issue May 16, 2020

Ignore code blocks for English-text-on-localized-site #3853

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

findspam.py: refactor and unify link extraction (and perhaps overall post handling) #2500

findspam.py: refactor and unify link extraction (and perhaps overall post handling) #2500

tripleee commented Oct 29, 2018

stale bot commented Oct 29, 2019

findspam.py: refactor and unify link extraction (and perhaps overall post handling) #2500

findspam.py: refactor and unify link extraction (and perhaps overall post handling) #2500

Comments

tripleee commented Oct 29, 2018

stale bot commented Oct 29, 2019