Indicate discrepencies with Unicode specifications for UTF-16/32 schemes #128571

youkidearitai · 2025-01-07T02:57:59Z

Bug report

Bug description:

b"ab".decode("UTF-16")

On https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G28070, UTF-16 is not pointing at endian (there is no BOM and in the absence of higher-level protocol), UTF-16 is big-endian.

The UTF-16 encoding scheme may or may not begin with a BOM. However, when there is no BOM, and in the absence of a higher-level protocol, the byte order of the UTF-16 encoding scheme is big-endian.

However, CPython actual behavior is maybe depends on CPU architecture.

I tested x86_64(WSL Ubuntu), and aarch64(Raspberry Pi(Raspbian) and macOS).

x86_64 result is 慢(U+6162), aarch64 result is 扡(U+6261).
I think endian is big-endian in UTF-16.

CPython versions tested on:

3.10, 3.12

Operating systems tested on:

Linux, macOS

The text was updated successfully, but these errors were encountered:

picnixz · 2025-01-07T10:22:15Z

The codecs docs say:

These constants define various byte sequences, being Unicode byte order marks (BOMs) for several encodings. They are used in UTF-16 and UTF-32 data streams to indicate the byte order used, and in UTF-8 as a Unicode signature. BOM_UTF16 is either BOM_UTF16_BE or BOM_UTF16_LE depending on the platform’s native byte order, BOM is an alias for BOM_UTF16, BOM_LE for BOM_UTF16_LE and BOM_BE for BOM_UTF16_BE. The others represent the BOM in UTF-8 and UTF-32 encodings.

The important part is:

BOM_UTF16 is either BOM_UTF16_BE or BOM_UTF16_LE depending on the platform’s native byte order, BOM is an alias for BOM_UTF16, BOM_LE for BOM_UTF16_LE and BOM_BE for BOM_UTF16_BE.

AFAIU, BOM is platform-dependent and UTF-16 uses BOM_UTF16 so it will be platform dependent. Finally, this is backed by the following statement (4th paragraph of https://docs.python.org/3/library/codecs.html#encodings-and-unicode).

All of these encodings can only encode 256 of the 1114112 code points defined in Unicode. A simple and straightforward way that can store each Unicode code point, is to store each code point as four consecutive bytes. There are two possibilities: store the bytes in big endian or in little endian order. These two encodings are called UTF-32-BE and UTF-32-LE respectively. Their disadvantage is that if e.g. you use UTF-32-BE on a little endian machine you will always have to swap bytes on encoding and decoding. UTF-32 avoids this problem: bytes will always be in natural endianness

Thus, AFAIK, the behaviour is correct. But we could definitely improve the docs so that this information is not burried across multiple pages..

picnixz · 2025-01-07T10:22:37Z

cc @serhiy-storchaka

picnixz · 2025-01-07T10:28:30Z

I also don't think we need to match UTF-16 for the UTF-16 as in the Unicode specs. And if we want to match it, it would cause a lot of breaking changes I think so I'm not sure we'll ever be able to change this. Alternatives are to create a new encoding that we name utf16-ces for utf16 canonical encoding scheme that exactly match the specifications.

serhiy-storchaka · 2025-01-07T11:19:02Z

I afraid that this is the case where the Unicode specification contradicts practice, and Python chose to follow practice. For example, on Linux in UTF-16 without BOM the byte order on little-endian machine is little-endian.

$ echo abc | iconv -t utf-16le | iconv -f utf-16
abc
$ echo abc | iconv -t utf-16be | iconv -f utf-16
愀戀挀਀

I think that Windows also uses little-endian, as it is natural on little-endian machines.

Changing UTF-16 now would be a great breaking change. But we should clarify more explicitly the difference with the Unicode specification in the documentation.

picnixz · 2025-01-07T11:37:17Z

But we should clarify more explicitly the difference with the Unicode specification in the documentation.

I will categorize this issue as a doc issue instead.

youkidearitai added the type-bug An unexpected behavior, bug, or error label Jan 7, 2025

picnixz added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Jan 7, 2025

picnixz added the pending The issue will be closed if no feedback is provided label Jan 7, 2025

serhiy-storchaka added the topic-unicode label Jan 7, 2025

picnixz added docs Documentation in the Doc dir and removed type-bug An unexpected behavior, bug, or error interpreter-core (Objects, Python, Grammar, and Parser dirs) pending The issue will be closed if no feedback is provided labels Jan 7, 2025

github-project-automation bot added this to docs issues Jan 7, 2025

github-project-automation bot moved this to Todo in docs issues Jan 7, 2025

picnixz changed the title ~~b"ab".decode("UTF-16") result is depends on CPU architecture endians~~ Indicate discrepencies with Unicode specifications for BOM-dependent schemes Jan 7, 2025

picnixz changed the title ~~Indicate discrepencies with Unicode specifications for BOM-dependent schemes~~ Indicate discrepencies with Unicode specifications for UTF-16/32 schemes Jan 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indicate discrepencies with Unicode specifications for UTF-16/32 schemes #128571

Indicate discrepencies with Unicode specifications for UTF-16/32 schemes #128571

youkidearitai commented Jan 7, 2025 •

edited by github-actions bot

Loading

picnixz commented Jan 7, 2025 •

edited

Loading

picnixz commented Jan 7, 2025

picnixz commented Jan 7, 2025

serhiy-storchaka commented Jan 7, 2025

picnixz commented Jan 7, 2025

Indicate discrepencies with Unicode specifications for UTF-16/32 schemes #128571

Indicate discrepencies with Unicode specifications for UTF-16/32 schemes #128571

Comments

youkidearitai commented Jan 7, 2025 • edited by github-actions bot Loading

Bug report

Bug description:

CPython versions tested on:

Operating systems tested on:

picnixz commented Jan 7, 2025 • edited Loading

picnixz commented Jan 7, 2025

picnixz commented Jan 7, 2025

serhiy-storchaka commented Jan 7, 2025

picnixz commented Jan 7, 2025

youkidearitai commented Jan 7, 2025 •

edited by github-actions bot

Loading

picnixz commented Jan 7, 2025 •

edited

Loading