Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore code blocks for English-text-on-localized-site #3853

Closed
wants to merge 3 commits into from

Conversation

user12986714
Copy link
Contributor

According to issue #3844

Changed regex so any code block will be excluded when evaluating whether there is English text on localized site.

Warning: the updated regex will exclude all pattern "sometext>". May be exploitable by spammers, by appending a ">" to the blacklisted word to circumvent blacklisting.

Changed regex so any code block will be excluded when evaluating whether there is English text on localized site.
Updated regex so urls within html tags get excluded when evaluating whether there is English text on localized sites.
@makyen
Copy link
Contributor

makyen commented May 16, 2020

Just FYI: the following regex will match valid HTML tags in HTML received from SE (and then pre-processed by SD):

/<((?:\/?(?:b|blockquote|code|del|dd|dl|dt|em|h[123]|i|kbd|li|p|s|sup|sub|strong|strike|ul|br|hr))|(?:\/(?:a|div|img|ol|pre|span))|(?:a|div|img|ol|pre|span)\b.*?)\s*\/?>/

Note that this is a JavaScript based regex, so you'll want to drop the / at the start and end.

The above regex could be used to strip the HTML tags from the text prior to running the detection. You may need to look at the interaction of how we strip code blocks. There was an issue in how that was done for another detection. I'd need to double check...

@user12986714
Copy link
Contributor Author

There are some issues with my second commit. The regex won't work as intended and the line was too long. Decided to revert the second commit.

@user12986714
Copy link
Contributor Author

user12986714 commented May 16, 2020

@makyen Looks like this regex won't work with negative lookahead. Yeah as you have said some magic may be needed in stripping, but a temporary solution may be disabling some phrases ended with '>'. (Commit 1c89 won't work with urls, they are too long so <a href="blah blah blah very long"> won't be disabled)

This reverts commit 3f939d4.

Reverting 3f939d4 for regex issues
Reverting 3f939d4 as the regex added in this commit won't work as intended, and the line was too long. Hence this commit adds little value to the project.
@user12986714
Copy link
Contributor Author

After checking the code base, it is not a trivial task for me to use that regex in stripping without major refactoring of the entire stripping/spam detection code logic. This may be possible when resolving issue #2500, in which refactoring is proposed. (@makyen )

@makyen
Copy link
Contributor

makyen commented May 18, 2020

Closed by request per chat.

@makyen makyen closed this May 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants