-
-
Notifications
You must be signed in to change notification settings - Fork 30.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Indicate discrepencies with Unicode specifications for UTF-16/32 schemes #128571
Comments
The codecs docs say:
The important part is:
AFAIU,
Thus, AFAIK, the behaviour is correct. But we could definitely improve the docs so that this information is not burried across multiple pages.. |
I also don't think we need to match |
I afraid that this is the case where the Unicode specification contradicts practice, and Python chose to follow practice. For example, on Linux in UTF-16 without BOM the byte order on little-endian machine is little-endian. $ echo abc | iconv -t utf-16le | iconv -f utf-16
abc
$ echo abc | iconv -t utf-16be | iconv -f utf-16
愀戀挀 I think that Windows also uses little-endian, as it is natural on little-endian machines. Changing UTF-16 now would be a great breaking change. But we should clarify more explicitly the difference with the Unicode specification in the documentation. |
I will categorize this issue as a doc issue instead. |
Bug report
Bug description:
On https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G28070, UTF-16 is not pointing at endian (there is no BOM and in the absence of higher-level protocol), UTF-16 is big-endian.
However, CPython actual behavior is maybe depends on CPU architecture.
I tested x86_64(WSL Ubuntu), and aarch64(Raspberry Pi(Raspbian) and macOS).
x86_64 result is
慢
(U+6162), aarch64 result is扡
(U+6261).I think endian is big-endian in
UTF-16
.CPython versions tested on:
3.10, 3.12
Operating systems tested on:
Linux, macOS
The text was updated successfully, but these errors were encountered: