Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should blacklisted_websites be anchored? #2796

Open
tripleee opened this issue Mar 27, 2019 · 4 comments
Open

Should blacklisted_websites be anchored? #2796

tripleee opened this issue Mar 27, 2019 · 4 comments
Labels
area: blacklists status: confirmed Confirmed as something that needs working on. type: feedback wanted "Closed as too opinion-based."

Comments

@tripleee
Copy link
Member

I thought this was a regression but it looks (from going back a couple of years in Git history) that the regex for blacklisted websites was in fact never anchored.

We have a false positive where ello.co matches trello.com and of course, fixing that in isolation would be trivial; but do we really have any cases where the absence of anchoring is actually a feature, or is this a bug which should have been fixed long ago?

@ArtOfCode-
Copy link
Member

Sounds buggish to me

@makyen
Copy link
Contributor

makyen commented Mar 27, 2019

I'd also looked into this, and found the same thing. My opinion is that these definitely should be anchored. Not anchoring them results in FP and is contrary to most people's expectations when moving something from the watchlist to the website blacklist. If a particular regex is intended to be matched unanchored, then it should be specifically constructed to do so, rather than all regexes in the list needing to be constructed with their own individual anchors.

My impression from previously looking at the regexes was that some few of the regexes rely on the fact that it's not anchored, but that the majority were written with the expectation that they are anchored and occasionally produce FP. Unfortunately, the only way I've seen to accurately determine if each individual regex is intended to be anchored or unanchored would be to run each regex through a MS search with and without the anchor which we choose and look at the number of TP and FP that are produced in each case.

Note: the anchor needs to be more complex than just \b, due to the validity of - in domain names and the assumption that a website regex should, by default, match up to the TLD, rather than only a subdomain (i.e. regexes wanting to match only a subdomain should be required to be specifically constructed to do so). The need for different anchors than are used in the watchlist/blacklist-keyword argues that we should have a watchlist that is specifically for websites. Doing so would make it easier to accurately determine the effectiveness of a watched domain.

@angussidney angussidney added area: blacklists type: feedback wanted "Closed as too opinion-based." labels Apr 6, 2019
@makyen
Copy link
Contributor

makyen commented Oct 9, 2019

Currently, for those website blacklist items which I see detect something they aren't intended to detect, I'm using bookends of:

\b(?<![^\W_]-)foobar\.com(?![.-][^\W_])\b

@stale stale bot added the status: stale label Nov 9, 2019
@stale
Copy link

stale bot commented Nov 10, 2019

This issue has been closed because it has had no recent activity. If this is still important, please add another comment and find someone with write permissions to reopen the issue. Thank you for your contributions.

@stale stale bot closed this as completed Nov 10, 2019
@makyen makyen added the status: confirmed Confirmed as something that needs working on. label Mar 1, 2020
@makyen makyen reopened this Mar 1, 2020
@stale stale bot removed the status: stale label Mar 1, 2020
user12986714 added a commit to user12986714/SmokeDetector that referenced this issue Jul 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: blacklists status: confirmed Confirmed as something that needs working on. type: feedback wanted "Closed as too opinion-based."
Development

No branches or pull requests

4 participants