Ignore code blocks for English-text-on-localized-site #3853

user12986714 · 2020-05-16T16:06:57Z

According to issue #3844

Changed regex so any code block will be excluded when evaluating whether there is English text on localized site.

Warning: the updated regex will exclude all pattern "sometext>". May be exploitable by spammers, by appending a ">" to the blacklisted word to circumvent blacklisting.

Changed regex so any code block will be excluded when evaluating whether there is English text on localized site.

Updated regex so urls within html tags get excluded when evaluating whether there is English text on localized sites.

makyen · 2020-05-16T16:32:57Z

Just FYI: the following regex will match valid HTML tags in HTML received from SE (and then pre-processed by SD):

/<((?:\/?(?:b|blockquote|code|del|dd|dl|dt|em|h[123]|i|kbd|li|p|s|sup|sub|strong|strike|ul|br|hr))|(?:\/(?:a|div|img|ol|pre|span))|(?:a|div|img|ol|pre|span)\b.*?)\s*\/?>/

Note that this is a JavaScript based regex, so you'll want to drop the / at the start and end.

The above regex could be used to strip the HTML tags from the text prior to running the detection. You may need to look at the interaction of how we strip code blocks. There was an issue in how that was done for another detection. I'd need to double check...

user12986714 · 2020-05-16T16:34:45Z

There are some issues with my second commit. The regex won't work as intended and the line was too long. Decided to revert the second commit.

user12986714 · 2020-05-16T16:54:20Z

@makyen Looks like this regex won't work with negative lookahead. Yeah as you have said some magic may be needed in stripping, but a temporary solution may be disabling some phrases ended with '>'. (Commit 1c89 won't work with urls, they are too long so <a href="blah blah blah very long"> won't be disabled)

This reverts commit 3f939d4. Reverting 3f939d4 for regex issues Reverting 3f939d4 as the regex added in this commit won't work as intended, and the line was too long. Hence this commit adds little value to the project.

user12986714 · 2020-05-16T17:57:47Z

After checking the code base, it is not a trivial task for me to use that regex in stripping without major refactoring of the entire stripping/spam detection code logic. This may be possible when resolving issue #2500, in which refactoring is proposed. (@makyen )

makyen · 2020-05-18T05:24:35Z

Closed by request per chat.

user12986714 added 2 commits May 16, 2020 12:03

Ignore code blocks for English-text-on-localized-site

1c89c73

Changed regex so any code block will be excluded when evaluating whether there is English text on localized site.

Update regex to exclude url when evaluating English

3f939d4

Updated regex so urls within html tags get excluded when evaluating whether there is English text on localized sites.

Revert "Update regex to exclude url when evaluating English"

793666c

This reverts commit 3f939d4. Reverting 3f939d4 for regex issues Reverting 3f939d4 as the regex added in this commit won't work as intended, and the line was too long. Hence this commit adds little value to the project.

makyen closed this May 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ignore code blocks for English-text-on-localized-site #3853

Ignore code blocks for English-text-on-localized-site #3853

user12986714 commented May 16, 2020

makyen commented May 16, 2020

user12986714 commented May 16, 2020

user12986714 commented May 16, 2020 •

edited

Loading

user12986714 commented May 16, 2020

makyen commented May 18, 2020

Ignore code blocks for English-text-on-localized-site #3853

Ignore code blocks for English-text-on-localized-site #3853

Conversation

user12986714 commented May 16, 2020

makyen commented May 16, 2020

user12986714 commented May 16, 2020

user12986714 commented May 16, 2020 • edited Loading

user12986714 commented May 16, 2020

makyen commented May 18, 2020

user12986714 commented May 16, 2020 •

edited

Loading