Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

findspam.py: refactor and unify link extraction (and perhaps overall post handling) #2500

Open
tripleee opened this issue Oct 29, 2018 · 1 comment
Labels
area: spamchecks Detections or the process of testing posts. (No space in the label, is because of Hacktoberfest) status: confirmed Confirmed as something that needs working on. type: refactor

Comments

@tripleee
Copy link
Member

There are multiple overlapping and sometimes conflicting attempts to enumerate all the links in a post in findspam.py. See below for a sampling.

We should unify these, and ideally reduce the number of times we iterate over the message text looking for more or less the same information.

An object-oriented approach to the entire problem would seem like a natural but somewhat involved solution. Instead of scanning the raw text of the post over and over, use the _Post object we already have to store the links once, and then just use the object's methods to retrieve the links you want.

This should also make it easier to keep information about a post's features between different methods in findspam.py which look for distinct but related features in a post.

Just to give you an idea of the scope of the problem, here are a few of the methods which attempt to analyze links.

  • misleading_link looks for an <a href=...">...</a> snippet using a fairly straightforward regex.
  • link_at_end has a different regex which looks for a URL in the anchor text, with some simple whitelisting of a few domains.
  • non_english_link has a similar structure, with a different regex, without whitelisting.
  • keyword_link has a fairly simple regex for the actual link, and a messy regex over a few lines of strings for the keywords in the anchor text.
  • bad_link_text has a different link, more similar to the link_at_end/non_english_link regex, and a bit over half a dozen lines of messy "bad keyword" regex.
  • bad_pattern_in_url has a different version of the <a href=...>...</a> regex, and two lines of bad patterns. (Confession: This one is mine.)
  • post_links is a helper method used by several other methods. It tries to enumerate links using the global URL_REGEX. This might be a good starting point for unifying the above methods, but needs to be significantly more complex if it should completely replace all the ad-hoc regexes in those other methods.
@angussidney angussidney added area: spamchecks Detections or the process of testing posts. (No space in the label, is because of Hacktoberfest) type: refactor labels Nov 1, 2018
@stale stale bot added the status: stale label Oct 25, 2019
@stale
Copy link

stale bot commented Oct 29, 2019

This issue has been closed because it has had no recent activity. If this is still important, please add another comment and find someone with write permissions to reopen the issue. Thank you for your contributions.

@stale stale bot closed this as completed Oct 29, 2019
@makyen makyen added the status: confirmed Confirmed as something that needs working on. label Mar 1, 2020
@makyen makyen reopened this Mar 1, 2020
@stale stale bot removed the status: stale label Mar 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: spamchecks Detections or the process of testing posts. (No space in the label, is because of Hacktoberfest) status: confirmed Confirmed as something that needs working on. type: refactor
Development

No branches or pull requests

3 participants