Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uses UTF8MB4 everywhere #8425

Open
wants to merge 38 commits into
base: release-3.0
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 8 commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
3a55544
Use InnoDB and dynamic rows for MySQL
Sesquipedalian Jul 4, 2024
9cce008
When installing, require utf8mb4 support for MySQL
Sesquipedalian Jul 4, 2024
d7c25ff
Create tables using utf8mb4 by default for MySQL
Sesquipedalian Jun 29, 2024
8a9edc8
Always use utf8mb4 input/output for MySQL
Sesquipedalian Jul 8, 2024
2957e71
Autodetects $this->character_set and $this->mb4 in DB APIs
Sesquipedalian Jun 29, 2024
4f21987
Rewrites the ConvertUtf8 upgrader step for mb4 support
Sesquipedalian Jul 6, 2024
c5e5ce9
Removes obsolete character set variables and related logic
Sesquipedalian Jul 4, 2024
5578ae9
Moves Utils::fixUtf8Mb4() to DatabaseApi::fix_mb4()
Sesquipedalian Jun 28, 2024
a5b106f
Transparently decodes entities for four-byte Unicode in Db\APIs\MySQL
Sesquipedalian Feb 2, 2025
5c86307
Merge branch 'release-3.0' into 3.0/utf8mb4
Sesquipedalian Feb 3, 2025
28b8f46
Handles non-existent databases in Db::detect_charset()
Sesquipedalian Feb 4, 2025
84f6dd7
Merge branch 'release-3.0' into 3.0/utf8mb4
Sesquipedalian Feb 4, 2025
a37b182
Merge branch 'release-3.0' into 3.0/utf8mb4
Sesquipedalian Feb 6, 2025
acf97e4
Handles non-existent databases in Db::detect_charset(), for real
Sesquipedalian Feb 6, 2025
cf44f8a
Fixes an incorrect variable name in upgrader's ConvertUtf8()
Sesquipedalian Feb 6, 2025
6135983
Don't try to convert charsets of columns that don't have them
Sesquipedalian Feb 6, 2025
7ddc08c
Utils::entityDecode() doesn't need this param anymore
Sesquipedalian Feb 6, 2025
668dae1
Adds missing quotes around default values in ConvertUtf8()
Sesquipedalian Feb 7, 2025
481c56b
Don't skip character 0xFF in source charset when converting to UTF-8
Sesquipedalian Feb 7, 2025
fa67ab2
Uses Db::change_column() in ConvertUtf8()
Sesquipedalian Feb 10, 2025
15bf23e
Merge branch 'release-3.0' into 3.0/utf8mb4
Sesquipedalian Feb 10, 2025
5d2966a
Convert to InnoDB before making other database changes
Sesquipedalian Feb 10, 2025
1f3153f
Merge branch 'release-3.0' into 3.0/utf8mb4
Sesquipedalian Feb 12, 2025
7cda44e
Temp fix for a weird error in the upgrader
Sesquipedalian Feb 16, 2025
4c91b12
Merge branch 'release-3.0' into 3.0/utf8mb4
Sesquipedalian Feb 16, 2025
98dab02
Only redirect to HTTPS if $_SERVER['REQUEST_URL'] starts with http://
Sesquipedalian Feb 16, 2025
747066a
Don't reconvert to utf8mb4 unnecessarily
Sesquipedalian Feb 16, 2025
8e5735a
Don't get stuck waiting for conversion to utf8mb4 to complete
Sesquipedalian Feb 17, 2025
974367a
Fixes user language lookup in upgrader
Sesquipedalian Feb 17, 2025
e4690d3
Populates calendar table's uid column in the SQL files
Sesquipedalian Feb 17, 2025
9fc4a2a
Don't get stuck waiting for ConvertToUtf8(), for real this time
Sesquipedalian Feb 17, 2025
df11e46
Fixes error when there are security warnings but not about files
Sesquipedalian Feb 17, 2025
dc702ae
Changes 'utf8mb3' to 'UTF-8' in param for mb_convert_encoding()
Sesquipedalian Feb 17, 2025
4eb2a2d
Don't clobber the privacy policy when upgrading to 3.0
Sesquipedalian Feb 17, 2025
acd7a9c
Don't set spurious previousCharacterSet during 3.0 → 3.0 upgrades
Sesquipedalian Feb 17, 2025
b7db1fe
Changes all utf8* variants to UTF-8 in param for mb_convert_encoding()
Sesquipedalian Feb 17, 2025
6952030
Don't perform byte-level conversion if already in UTF-8
Sesquipedalian Feb 17, 2025
14ea6ae
Be explicit about utf8mb3 in $charset_maps
Sesquipedalian Feb 18, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 0 additions & 2 deletions Languages/en_US/General.php
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,6 @@
//https://developers.google.com/recaptcha/docs/language
$txt['lang_recaptcha'] = 'en';

// Ensure you remember to use uppercase for character set strings.
$txt['lang_character_set'] = 'UTF-8';
// Character set right to left? 0 = ltr; 1 = rtl
$txt['lang_rtl'] = '0';

Expand Down
3 changes: 1 addition & 2 deletions Languages/en_US/Install.php
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,7 @@

// Version: 3.0 Alpha 2; Install

// These should be the same as those in index.language.php.
$txt['lang_character_set'] = 'UTF-8';
// This should be the same as the one in General.php.
$txt['lang_rtl'] = '0';

$txt['install_step_welcome'] = 'Welcome';
Expand Down
2 changes: 1 addition & 1 deletion Sources/Actions/Admin/ACP.php
Original file line number Diff line number Diff line change
Expand Up @@ -1059,7 +1059,7 @@ public static function saveSettings(array &$config_vars): void

// Fix the darn stupid cookiename! (more may not be allowed, but these for sure!)
if (isset($_POST['cookiename'])) {
$_POST['cookiename'] = preg_replace('~[,;\s\.$]+~' . (Utils::$context['utf8'] ? 'u' : ''), '', $_POST['cookiename']);
$_POST['cookiename'] = preg_replace('~[,;\s\.$]+~u', '', $_POST['cookiename']);
}

// Fix the forum's URL if necessary.
Expand Down
22 changes: 15 additions & 7 deletions Sources/Actions/Admin/Languages.php
Original file line number Diff line number Diff line change
Expand Up @@ -887,7 +887,11 @@ function ($val1, $val2) {
$replace_array = [];

foreach ($primary_settings as $setting => $type) {
$replace_array['~\$txt\[\'' . $setting . '\'\]\s*=\s*[^\r\n]+~'] = '$txt[\'' . $setting . '\'] = ' . ($type === 'bool' ? (!empty($_POST[$setting]) ? 'true' : 'false') : '\'' . ($setting = 'native_name' ? htmlentities(Utils::htmlspecialcharsDecode($_POST[$setting]), ENT_QUOTES, Utils::$context['character_set']) : preg_replace('~[^\w-]~i', '', $_POST[$setting])) . '\'') . ';';
if ($setting === 'lang_character_set') {
$replace_array['/\$txt\[\'' . $setting . '\'\]\s*=\s*[^\r\n]+\R/u'] = '';
} else {
$replace_array['~\$txt\[\'' . $setting . '\'\]\s*=\s*[^\r\n]+~u'] = '$txt[\'' . $setting . '\'] = ' . ($type === 'bool' ? (!empty($_POST[$setting]) ? '\'1\'' : '\'0\'') : '\'' . ($setting = 'native_name' ? htmlentities(Utils::htmlspecialcharsDecode($_POST[$setting]), ENT_QUOTES, 'UTF-8') : preg_replace('~[^\w-]~i', '', $_POST[$setting])) . '\'') . ';';
}
}

$current_data = preg_replace(array_keys($replace_array), array_values($replace_array), $current_data);
Expand All @@ -910,6 +914,10 @@ function ($val1, $val2) {
Utils::$context['primary_settings']['name'] = Utils::ucwords(strtr($lang_id, ['_' => ' ', '-utf8' => '']));

foreach ($primary_settings as $setting => $type) {
if ($setting === 'lang_character_set') {
continue;
}

Utils::$context['primary_settings'][$setting] = [
'label' => str_replace('lang_', '', $setting),
'value' => $type === 'bool' ? !empty(Lang::$txt[$setting]) : Lang::$txt[$setting],
Expand Down Expand Up @@ -1005,12 +1013,12 @@ function ($val1, $val2) {
// Also, remove any lines for uneditable variables like $forum_copyright from the working data.
$entries = [];

foreach (preg_split('~^(?=\$(?:' . implode('|', $string_types) . ')\[\'([^\n]+?)\'\])~m' . (Utils::$context['utf8'] ? 'u' : ''), preg_replace('~\s*\n(\$(?!(?:' . implode('|', $string_types) . '))[^\n]*)~', '', file_get_contents($current_file))) as $blob) {
foreach (preg_split('~^(?=\$(?:' . implode('|', $string_types) . ')\[\'([^\n]+?)\'\])~mu', preg_replace('~\s*\n(\$(?!(?:' . implode('|', $string_types) . '))[^\n]*)~', '', file_get_contents($current_file))) as $blob) {
// Comment lines at the end of the blob can make terrible messes
$blob = preg_replace('~(\n[ \t]*//[^\n]*)*$~' . (Utils::$context['utf8'] ? 'u' : ''), '', $blob);
$blob = preg_replace('~(\n[ \t]*//[^\n]*)*$~u', '', $blob);

// Extract the variable
if (preg_match('~^\$(' . implode('|', $string_types) . ')\[\'([^\n]+?)\'\](?:\[\'?([^\n]+?)\'?\])?\s?=\s?(.+);([ \t]*(?://[^\n]*)?)$~ms' . (Utils::$context['utf8'] ? 'u' : ''), strtr($blob, ["\r" => '']), $matches)) {
if (preg_match('~^\$(' . implode('|', $string_types) . ')\[\'([^\n]+?)\'\](?:\[\'?([^\n]+?)\'?\])?\s?=\s?(.+);([ \t]*(?://[^\n]*)?)$~msu', strtr($blob, ["\r" => '']), $matches)) {
// If no valid subkey was found, we need it to be explicitly null
$matches[3] = isset($matches[3]) && $matches[3] !== '' ? $matches[3] : null;

Expand Down Expand Up @@ -1093,7 +1101,7 @@ function ($val1, $val2) {
# Followed by a comma or the end of the string
(?=,|$)

/x' . (Utils::$context['utf8'] ? 'u' : ''), $entryValue['entry'], $matches);
/xu', $entryValue['entry'], $matches);

if (empty($m)) {
continue;
Expand Down Expand Up @@ -1325,7 +1333,7 @@ function ($val1, $val2) {
// Apply our changes.
foreach ($final_saves as $save) {
if (!empty($save['is_regex'])) {
$file_contents = preg_replace('~' . $save['find'] . '~' . (Utils::$context['utf8'] ? 'u' : ''), $save['replace'], $file_contents);
$file_contents = preg_replace('~' . $save['find'] . '~u', $save['replace'], $file_contents);
} else {
$file_contents = str_replace($save['find'], $save['replace'], $file_contents);
}
Expand Down Expand Up @@ -1567,7 +1575,7 @@ public static function list_getLanguages(): array
$languages[$lang['filename']] = [
'id' => $lang['filename'],
'count' => 0,
'char_set' => $txt['lang_character_set'],
'char_set' => 'UTF-8',
'default' => Lang::$default == $lang['filename'] || (Lang::$default == '' && $lang['filename'] == 'en_US'),
'locale' => $txt['lang_locale'],
'name' => $lang['name'],
Expand Down
7 changes: 1 addition & 6 deletions Sources/Actions/Admin/Maintenance.php
Original file line number Diff line number Diff line change
Expand Up @@ -170,7 +170,7 @@ public function routine(): void
public function database(): void
{
// Show some conversion options?
Utils::$context['convert_entities'] = isset(Config::$modSettings['global_character_set']) && Config::$modSettings['global_character_set'] === 'UTF-8';
Utils::$context['convert_entities'] = true;

if (Config::$db_type == 'mysql') {
$colData = Db::$db->list_columns('{db_prefix}messages', true);
Expand Down Expand Up @@ -1024,11 +1024,6 @@ public function entitiesToUnicode(): void
{
User::$me->isAllowedTo('admin_forum');

// Check to see if UTF-8 is currently the default character set.
if (Config::$modSettings['global_character_set'] !== 'UTF-8') {
ErrorHandler::fatalLang('entity_convert_only_utf8');
}

// Some starting values.
Utils::$context['table'] = empty($_REQUEST['table']) ? 0 : (int) $_REQUEST['table'];
Utils::$context['start'] = empty($_REQUEST['start']) ? 0 : (int) $_REQUEST['start'];
Expand Down
10 changes: 0 additions & 10 deletions Sources/Actions/AttachmentDownload.php
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,6 @@
use SMF\Config;
use SMF\Db\DatabaseApi as Db;
use SMF\IntegrationHook;
use SMF\Lang;
use SMF\User;
use SMF\Utils;

Expand Down Expand Up @@ -320,15 +319,6 @@ public function execute(): void
*/
protected function __construct()
{
// Some defaults that we need.
if (!isset(Utils::$context['character_set'])) {
Utils::$context['character_set'] = empty(Config::$modSettings['global_character_set']) ? (empty(Lang::$txt['lang_character_set']) ? 'ISO-8859-1' : Lang::$txt['lang_character_set']) : Config::$modSettings['global_character_set'];
}

if (!isset(Utils::$context['utf8'])) {
Utils::$context['utf8'] = Utils::$context['character_set'] === 'UTF-8';
}

// Which attachment was requested?
$this->id = $_REQUEST['attach'] = isset($_REQUEST['attach']) ? (int) $_REQUEST['attach'] : (int) (isset($_REQUEST['id']) ? (int) $_REQUEST['id'] : 0);

Expand Down
2 changes: 1 addition & 1 deletion Sources/Actions/AttachmentUpload.php
Original file line number Diff line number Diff line change
Expand Up @@ -558,7 +558,7 @@ protected function sendResponse(): void
}

// Set the header.
header('content-type: application/json; charset=' . Utils::$context['character_set'] . '');
header('content-type: application/json; charset=UTF-8');

echo Utils::jsonEncode($this->_response ? $this->_response : []);

Expand Down
6 changes: 3 additions & 3 deletions Sources/Actions/Feed.php
Original file line number Diff line number Diff line change
Expand Up @@ -535,11 +535,11 @@ public function emit(): void

$filename[] = $this->format;

$filename = preg_replace(Utils::$context['utf8'] ? '/[^\p{L}\p{M}\p{N}\-]+/u' : '/[\s_,.\/\\;:\'<>?|\[\]{}~!@#$%^&*()=+`]+/', '_', str_replace('"', '', Utils::htmlspecialcharsDecode(strip_tags(implode('-', $filename)))));
$filename = preg_replace('/[^\p{L}\p{M}\p{N}\-]+/u', '_', str_replace('"', '', Utils::htmlspecialcharsDecode(strip_tags(implode('-', $filename)))));

$file = [
'filename' => $filename . '.xml',
'mime_type' => self::MIME_TYPES[$this->format] . '; charset=' . (empty(Utils::$context['character_set']) ? 'UTF-8' : Utils::$context['character_set']),
'mime_type' => self::MIME_TYPES[$this->format] . '; charset=UTF-8',
'content' => implode('', $this->xml),
'disposition' => isset($_GET['download']) ? 'attachment' : 'inline',
];
Expand Down Expand Up @@ -2765,7 +2765,7 @@ public static function build(string $format, array $data, array $metadata, strin
Utils::$context['feed'] = [];

// First, output the xml header.
Utils::$context['feed']['header'] = '<?xml version="1.0" encoding="' . Utils::$context['character_set'] . '"?' . '>' . ($doctype !== '' ? "\n" . trim($doctype) : '');
Utils::$context['feed']['header'] = '<' . '?xml version="1.0" encoding="UTF-8"?' . '>' . ($doctype !== '' ? "\n" . trim($doctype) : '');

// Are we outputting an rss feed or one with more information?
if ($format == 'rss' || $format == 'rss2') {
Expand Down
2 changes: 1 addition & 1 deletion Sources/Actions/Login2.php
Original file line number Diff line number Diff line change
Expand Up @@ -519,7 +519,7 @@ protected function checkPasswordFallbacks(): bool
$other_passwords[] = sha1(strtolower(User::$profiles[User::$my_id]['member_name']) . Utils::htmlspecialcharsDecode($_POST['passwrd']));

// Perhaps we converted to UTF-8 and have a valid password being hashed differently.
if (Utils::$context['character_set'] == 'UTF-8' && !empty(Config::$modSettings['previousCharacterSet']) && Config::$modSettings['previousCharacterSet'] != 'utf8') {
if (!empty(Config::$modSettings['previousCharacterSet']) && Config::$modSettings['previousCharacterSet'] != 'utf8') {
// Try iconv first, for no particular reason.
if (function_exists('iconv')) {
$other_passwords['iconv'] = sha1(strtolower(iconv('UTF-8', Config::$modSettings['previousCharacterSet'], User::$profiles[User::$my_id]['member_name'])) . Utils::htmlspecialcharsDecode(iconv('UTF-8', Config::$modSettings['previousCharacterSet'], $_POST['passwrd'])));
Expand Down
2 changes: 1 addition & 1 deletion Sources/Actions/Memberlist.php
Original file line number Diff line number Diff line change
Expand Up @@ -293,7 +293,7 @@ public function all(): void
}

if (!is_numeric($_REQUEST['start'])) {
if (preg_match('~^[^\'\\\\/]~' . (Utils::$context['utf8'] ? 'u' : ''), Utils::strtolower($_REQUEST['start']), $match) === 0) {
if (preg_match('~^[^\'\\\\/]~u', Utils::strtolower($_REQUEST['start']), $match) === 0) {
ErrorHandler::fatal('Are you a wannabe hacker?', false);
}

Expand Down
12 changes: 2 additions & 10 deletions Sources/Actions/RequestMembers.php
Original file line number Diff line number Diff line change
Expand Up @@ -60,9 +60,7 @@ public function execute(): void
{
User::$me->checkSession('get');

if (Utils::$context['utf8'] || function_exists('mb_convert_encoding')) {
header('content-type: text/plain; charset=UTF-8');
}
header('content-type: text/plain; charset=UTF-8');

$request = Db::$db->query(
'',
Expand All @@ -81,15 +79,9 @@ public function execute(): void
);

while ($row = Db::$db->fetch_assoc($request)) {
if (!Utils::$context['utf8']) {
if (($temp = @mb_convert_encoding($row['real_name'], 'UTF-8', Utils::$context['character_set'])) !== false) {
$row['real_name'] = $temp;
}
}

$row['real_name'] = strtr($row['real_name'], ['&amp;' => '&#038;', '&lt;' => '&#060;', '&gt;' => '&#062;', '&quot;' => '&#034;']);

$row['real_name'] = Utils::entityDecode($row['real_name'], true);
$row['real_name'] = Utils::entityDecode($row['real_name']);

echo $row['real_name'], "\n";
}
Expand Down
35 changes: 7 additions & 28 deletions Sources/Autolinker.php
Original file line number Diff line number Diff line change
Expand Up @@ -154,13 +154,6 @@ class Autolinker
* Internal properties
*********************/

/**
* @var string
*
* The character encoding being used.
*/
protected string $encoding = 'UTF-8';

/**
* @var bool
*
Expand Down Expand Up @@ -258,20 +251,6 @@ public function __construct(bool $only_basic = false)
{
$this->only_basic = $only_basic;

if (!empty(Utils::$context['utf8'])) {
$this->encoding = 'UTF-8';
} else {
$this->encoding = !empty(Config::$modSettings['global_character_set']) ? Config::$modSettings['global_character_set'] : (!empty(Lang::$txt['lang_character_set']) ? Lang::$txt['lang_character_set'] : $this->encoding);

if (in_array($this->encoding, mb_encoding_aliases('UTF-8'))) {
$this->encoding = 'UTF-8';
}
}

if ($this->encoding !== 'UTF-8') {
self::$domain_label_chars = '0-9A-Za-z\-';
}

// In case a mod wants to control behaviour for a special URI scheme.
if (!self::$integrate_autolinker_schemes_done) {
IntegrationHook::call('integrate_autolinker_schemes', [&self::$schemes]);
Expand Down Expand Up @@ -409,14 +388,14 @@ public function detectUrls(string $string, bool $plaintext_only = false): array
'((?' . '>' . '[^\[]|\[/?(?!' . $no_autolink_regex . ')' . '|(?1))*)' .
// 4 = Closing BBC markup element.
'(\[/\2\])' .
'~i' . ($this->encoding === 'UTF-8' ? 'u' : ''),
'~iu',
fn($matches) => $matches[1] . str_repeat('x', strlen($matches[3])) . $matches[4],
$string,
);

// Overwrite all BBC markup elements.
$string = preg_replace_callback(
'~\[/?' . Parser::getBBCodeTagsRegex() . '[^\]]*\]~i' . ($this->encoding === 'UTF-8' ? 'u' : ''),
'~\[/?' . Parser::getBBCodeTagsRegex() . '[^\]]*\]~iu',
fn($matches) => str_repeat(' ', strlen($matches[0])),
$string,
);
Expand All @@ -430,21 +409,21 @@ public function detectUrls(string $string, bool $plaintext_only = false): array
'((?' . '>' . '[^<]|</?(?!a)' . '|(?1))*)' .
// 3 = Closing 'a' markup element.
'(</a>)' .
'~i' . ($this->encoding === 'UTF-8' ? 'u' : ''),
'~iu',
fn($matches) => $matches[1] . str_repeat('x', strlen($matches[2])) . $matches[3],
$string,
);

// Overwrite all HTML elements.
$string = preg_replace_callback(
'~</?(\w+)\b([^>]*)>~i' . ($this->encoding === 'UTF-8' ? 'u' : ''),
'~</?(\w+)\b([^>]*)>~iu',
fn($matches) => str_repeat(' ', strlen($matches[0])),
$string,
);
}

preg_match_all(
'~' . $this->url_regex . '~i' . ($this->encoding === 'UTF-8' ? 'u' : ''),
'~' . $this->url_regex . '~iu',
$string,
$matches,
PREG_OFFSET_CAPTURE,
Expand Down Expand Up @@ -482,7 +461,7 @@ public function detectEmails(string $string, bool $plaintext_only = false): arra
$this->setEmailRegex();

preg_match_all(
'~' . ($plaintext_only ? '(?:^|\s|<br>)\K' : '') . $this->email_regex . '~i' . ($this->encoding === 'UTF-8' ? 'u' : ''),
'~' . ($plaintext_only ? '(?:^|\s|<br>)\K' : '') . $this->email_regex . '~iu',
$string,
$matches,
PREG_OFFSET_CAPTURE,
Expand Down Expand Up @@ -834,7 +813,7 @@ protected function setTldRegex(): void
return;
}

if (!$this->only_basic && $this->encoding === 'UTF-8') {
if (!$this->only_basic) {
Url::setTldRegex();
$this->tld_regex = Config::$modSettings['tld_regex'];
} else {
Expand Down
Loading