Strings in Emacs Lisp are somewhat difficult to deal with, for the following reasons:
-
They can be either "unibyte" strings, which correspond to byte vectors in Scheme, and "multibyte" strings, which can handle unicode. Whether a string is considered unibyte or multibyte depends on its contents; see Section 2.3.8.2, "Non-ASCII Characters in Strings" in the Emacs Lisp manual for details.
-
Whether a string is considered unibyte or multibyte not only depends on its contents, but also the source it is read from.
-
A multibyte string can include characters outside of the unicode codepoint range. This happens for instance when the string includes a hexadecimal or octal escape interpreted as a single byte, potentially violating the encoding rules of the multibyte source.
-
Emacs Lisp string syntax supports a multitude of escaping modes, some of which originate from representing keyboard event sequences in strings. Using these "keyboard-oriented" escapes inside strings is explicitly discouraged in the Emacs Lisp manual.
The way lexpr
deals with this complexity is the following:
-
The input source is always considered to be "multibyte" using the UTF-8 encoding; other encodings are not supported.
-
Mixing non-ASCII UTF-8 characters, either directly part of the input or represented using escape sequences, and hexadecimal or octal escape sequences resulting in a single byte outside of the ASCII range will result in a parse error. For instance, the following string cannot be parsed by
lexpr
:"\xFC\N{U+203D}"
Emacs, however, would parse this as a string containing the "character" sequence
#x3ffffc
,#x203d
. Note that the first "character" is not a valid unicode codepoint. -
Strings containing only ASCII characters and at least one single-byte hexadecimal or octal escape will be parsed as byte vectors instead of strings. This mirrors the Emacs Lisp rules for when a string will be considered to be "unibyte".
When producing S-expression text, byte vectors will always be represented as a sequence of octal-escaped bytes.
-
The escaping styles supported by
lexpr
are:- Hexadecimal (
\xN...
) and octal (\N...
) - Unicode (
\uNNNN
,\U00NNNNNN
) - Named unicode (
\N{U+X...}
). Note that the syntax that refers to codepoints using their full name (e.g.\N{LATIN SMALL LETTER A WITH GRAVE}
) is deliberately not supported.
- Hexadecimal (
It is expected that these restrictions will not be an impediment when using S-expressions as a data exchange format between Emacs Lisp and Rust programs. In short, S-expressions produced by Rust should be always be parsable by Emacs, and the other direction should work as long as there are no strings with non-unicode "characters" are involved.