Emacs Lisp Strings

Strings in Emacs Lisp are somewhat difficult to deal with, for the following reasons:

They can be either "unibyte" strings, which correspond to byte vectors in Scheme, and "multibyte" strings, which can handle unicode. Whether a string is considered unibyte or multibyte depends on its contents; see Section 2.3.8.2, "Non-ASCII Characters in Strings" in the Emacs Lisp manual for details.
Whether a string is considered unibyte or multibyte not only depends on its contents, but also the source it is read from.
A multibyte string can include characters outside of the unicode codepoint range. This happens for instance when the string includes a hexadecimal or octal escape interpreted as a single byte, potentially violating the encoding rules of the multibyte source.
Emacs Lisp string syntax supports a multitude of escaping modes, some of which originate from representing keyboard event sequences in strings. Using these "keyboard-oriented" escapes inside strings is explicitly discouraged in the Emacs Lisp manual.

The way lexpr deals with this complexity is the following:

The input source is always considered to be "multibyte" using the UTF-8 encoding; other encodings are not supported.
Mixing non-ASCII UTF-8 characters, either directly part of the input or represented using escape sequences, and hexadecimal or octal escape sequences resulting in a single byte outside of the ASCII range will result in a parse error. For instance, the following string cannot be parsed by lexpr:

"\xFC\N{U+203D}"

Emacs, however, would parse this as a string containing the "character" sequence #x3ffffc, #x203d. Note that the first "character" is not a valid unicode codepoint.
Strings containing only ASCII characters and at least one single-byte hexadecimal or octal escape will be parsed as byte vectors instead of strings. This mirrors the Emacs Lisp rules for when a string will be considered to be "unibyte".

When producing S-expression text, byte vectors will always be represented as a sequence of octal-escaped bytes.
The escaping styles supported by lexpr are:
- Hexadecimal (\xN...) and octal (\N...)
- Unicode (\uNNNN, \U00NNNNNN)
- Named unicode (\N{U+X...}). Note that the syntax that refers to codepoints using their full name (e.g. \N{LATIN SMALL LETTER A WITH GRAVE}) is deliberately not supported.

It is expected that these restrictions will not be an impediment when using S-expressions as a data exchange format between Emacs Lisp and Rust programs. In short, S-expressions produced by Rust should be always be parsable by Emacs, and the other direction should work as long as there are no strings with non-unicode "characters" are involved.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

elisp-strings.md

elisp-strings.md

Emacs Lisp Strings

Files

elisp-strings.md

Latest commit

History

elisp-strings.md

File metadata and controls

Emacs Lisp Strings