From 0ae01abf8cf6333bb1cc4d19781b9432c97cd135 Mon Sep 17 00:00:00 2001 From: "aswins2108@gmail.com" Date: Sun, 5 Mar 2023 17:43:46 +0530 Subject: [PATCH 01/13] Added info on symbolic tokens in design docs --- docs/design/lexical_conventions/README.md | 7 +- .../lexical_conventions/symbolic_tokens.md | 250 ++++++++++++++++++ 2 files changed, 255 insertions(+), 2 deletions(-) create mode 100644 docs/design/lexical_conventions/symbolic_tokens.md diff --git a/docs/design/lexical_conventions/README.md b/docs/design/lexical_conventions/README.md index 1acef7e6bf46e..e7a60cb2eadca 100644 --- a/docs/design/lexical_conventions/README.md +++ b/docs/design/lexical_conventions/README.md @@ -10,7 +10,9 @@ SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception ## Table of contents -- [Lexical elements](#lexical-elements) +- [Lexical conventions](#lexical-conventions) + - [Table of contents](#table-of-contents) + - [Lexical elements](#lexical-elements) @@ -27,7 +29,8 @@ A _lexical element_ is one of the following: - a literal: - a [numeric literal](numeric_literals.md) - a [string literal](string_literals.md) -- TODO: operators, comments, ... +- TODO: comments, ... +- a [symbolic token](symbolic_tokens.md) The sequence of lexical elements is formed by repeatedly removing the longest initial sequence of characters that forms a valid lexical element. diff --git a/docs/design/lexical_conventions/symbolic_tokens.md b/docs/design/lexical_conventions/symbolic_tokens.md new file mode 100644 index 0000000000000..22e191a432943 --- /dev/null +++ b/docs/design/lexical_conventions/symbolic_tokens.md @@ -0,0 +1,250 @@ + + +# Symbolic Tokens + + + + + + +## Table of contents + +- Overview +- Details +- Alternatives considered +- References + + + +## Overview + +Symbolic tokens are a set of tokens used to represent operators. Operators are +one use of symbolic tokens, but they are also used in patterns (:), declarations +(-> to indicate return type, to separate parameters), statements (;, =, and so +on), and other places (, to separate function call arguments). + +Some languages have a fixed set of symbolic tokens, For example: +[C++ operators](https://eel.is/c++draft/lex.operators) and +[rust operators](https://doc.rust-lang.org/book/appendix-02-operators.html). +While some others have extensible rules for defining operators, including the +facility for a developer to define operators that aren't part of the +baselanguage. For example: +[Swift operator rules](https://docs.swift.org/swift-book/ReferenceManual/LexicalStructure.html#ID418), +[Haskell operator rules](https://www.haskell.org/onlinereport/haskell2010/haskellch2.html#dx7-18008). + +Carbon has a fixed set of tokens that represent operators, defined by the +language specification. Developers cannot define new tokens to represent new +operators. + +Symbolic tokens are lexed using a "max munch" rule: at each lexing step, the +longest symbolic token defined by the language specification that appears +starting at the current input position is lexed, if any. + +Not all uses of symbolic tokens within the Carbon grammar will be treated as +operators. For example, `(` and `)` tokens serves to delimit various grammar +productions, and we may not want to consider `.` to be an operator, because its +right "operand" is not an expression. + +The presence or absence of whitespace around the symbolic token is used to +determine its fixity, in the same way we expect a human reader to recognize +them. For example, we want `a* - 4` to treat the `*` as a unary operator and the +`-` as a binary operator, while `a * -4` treats `*` as a mathematical operation +and `-` as the negative sign. Hence we can say that the whitespaces plays a +really important role here, and we use some rules to avoid confusion: + +- There can be no whitespace between a unary operator and its operand. +- The whitespace around a binary operator must be consistent: either there is + whitespace on both sides or on neither side. +- If there is whitespace on neither side of a binary operator, the token + before the operator must be an identifier, a literal, or any kind of closing + bracket (for example, `)`, `]`, or `}`), and the token after the operator + must be an identifier, a literal, or any kind of opening bracket (for + example, `(`, `[`, or `{`). + +## Details + +Symbolic tokens are intended to be used for widely-recognized operators, such as +the mathematical operators `+`, `*`, `<`, and so on. Those used as operators +would generally be expected to also be meaningful for some user-defined types, +and should be candidates for being made overloadable once we support operator +overloading. + +### Symbolic token list + +The following is the initial list of symbolic tokens recognized in a Carbon +source file: + +| Token | Explanation | +| ----- | ---------------------------------------------------------------------------------------------------------- | +| `*` | Indirection, multiplication, and forming pointers | +| `&` | Address-of or Bitwise AND | +| `=` | Assignment | +| `->` | Return type and `p->x` equivalent to `(*p).x` (in C++) | +| `=>` | Match syntax | +| `[]` | Subscript | +| `()` | Function call and function declaration | +| `{}` | Struct literals, blocks of control flow statements and the bodies of definitions (classes, functions, etc) | +| `,` | Separate arguments in a function call, elements of a tuple, or parameters of a function declaration | +| `.` | Member access | +| `:` | Scope | + +This list is expected to grow over time as more symbolic tokens are required by +language proposals. + +Note: The above list only covers up to +[#601](https://github.com/carbon-language/carbon-lang/pull/601) and more have +been added since that are not reflected here. + +### Whitespace + +to support the use of the same symbolic token as a prefix operator, an infix +operator, and a postfix operator (in some cases) we want a rule that allows us +to simply and unambiguously parse operators that might have all three fixities. + +For example, given the expression `a * - b`, there are two possible parses: + +- As `a * (- b)`, multiplying `a` by the negation of `b`. +- As `(a *) - b`, subtracting `b` from the pointer type `a *`. + +The chosen rule to distinguish such cases is to consider the presence or absence +of whitespace, as it strikes a good balance between simplicity and +expressiveness for the programmer and simplicity and good support for error +recovery in the implementation. Hence `a * -b` uses the first interpretation, +`a* - b` uses the second interpretation, and other combinations (`a*-b`, +`a *- b`, `a* -b`, `a * - b`, `a*- b`, `a *-b`) are rejected as errors. + +We require whitespace to be present or absent around the operator to indicate +its fixity, as this is a cue that a human reader would use to understand the +code: binary operators have whitespace on both sides, and unary operators lack +whitespace between the operator and its operand. + +But in some cases omitting the whitespace around a binary operator aids +readability, such as in expressions like `2*x*x + 3*x + 1`, hence we have an +allowance in such cases. In this case the operator with whitespace on neither +side, if the token immediately before the operator indicates it is the end of an +operand, and the token immediately after the operator indicates it is the +beginning of an operand, the operator is treated as binary. + +The defined set of tokens that constitutes the beginning or end of an operand +are: + +- Identifiers, as in `x*x + y*y`. +- Literals, as in `3*x + 4*y` or `"foo"+s`. +- Brackets of any kind, facing away from the operator, as in `f()*(n + 3)` or + `args[3]*{.real=4, .imag=1}`. + +For error recovery purposes, this rule functions best if no expression context +can be preceded by a token that looks like the end of an operand and no +expression context can be followed by a token that looks like the start of an +operand. One known exception to this is in function definitions: + +``` +fn F(p: Int *) -> Int * { return p; } +``` + +Both occurrences of `Int *` here are erroneous. The first is easy to detect and +diagnose, but the second is more challenging, if `{...}` is a valid expression +form. We expect to be able to easily distinguish between code blocks starting +with `{` and expressions starting with `{` for all cases other than `{}`. +However, the code block `{}` is not a reasonable body for a function with a +return type, so we expect errors involving a combination of misplaced whitespace +and `{}` to be rare, and we should be able to recover well from the remaining +cases. + +From the perspective of token formation, the whitespace rule means that there +are four _variants_ of each symbolic token: + +- A symbolic token with whitespace on both sides is a _binary_ variant of the + token. +- A symbolic token with whitespace on neither side, where the preceding token + is an identifier, literal, or closing bracket, and the following token is an + identifier, literal, or `(`, is also a _binary_ variant of the token. +- A symbolic token with whitespace on neither side that does not satisfy the + preceding rule is a _unary_ variant of the token. +- A symbolic token with whitespace on the left side only is a _prefix_ variant + of the token. +- A symbolic token with whitespace on the right side only is a _postfix_ + variant of the token. + +When used in non-operator contexts, any variant of a symbolic token is +acceptable. When used in operator contexts, only a binary variant of a token can +be used as a binary operator, only a prefix or unary variant of a token can be +used as a prefix operator, and only a postfix or unary variant of a token can be +used as a postfix operator. + +## Alternatives considered + +- Lexing the longest sequence of symbolic characters rather than lexing only + the longest known operator. + + Advantages - Adding new operators could be done without any change to the + lexing rules. - If unknown operators are rejected, adding new operators + would carry no risk of changing the meaning of existing valid code. + + Disadvantages: + + - Sequences of prefix or postfix operators would require parentheses or + whitespace. For example, `Int**` would lex as `Int` followed by a single + `**` token, and `**p` would lex as a single `**` token followed by `p`, + if there is no `**` operator. While we could define `**`, `***`, and so + on as operators, doing so would add complexity and inconsistency to the + language rules. + +- Supporting an extensible operator set, giving the developer the option to + add new operators. Advantages: + + - This would increase expressivity, especially for embedded + domain-specific languages. + + Disadvantages: + + - This would harm readability, at least for those unfamiliar with the code + using the operators. + - This could harm our ability to evolve the language, by admitting the + possibility of a custom operator colliding with a newly-introduced + standard operator, although this risk could be reduced by providing a + separate lexical syntax for custom operators. + - We would need to either lex the longest sequence of symbolic characters + we can, which has the same disadvantage discussed for that approach + above, or use a more sophisticated rule to determine how to split + operators -- perhaps based on what operator overloads are in scope -- + increasing complexity. + +- We could apply different whitespace restrictions or no whitespace + restrictions + ([#520](https://github.com/carbon-language/carbon-lang/issues/520)). We + could require whitespace around a binary operator followed by `[` or `{`. In + particular, for examples such as: + + ``` + fn F() -> Int*{ return Null; } + var n: Int = pointer_to_array^[i]; + ``` + + This would allow us to form a unary operator instead of a binary operator, + which is likely to be more in line with the developer's expectations. + + Advantages: + + - Room to add a postfix `^` dereference operator, or similarly any other + postfix operator producing an array, without creating surprises for + pointers to arrays. + - Allows the whitespace before the `{` of a function body to be + consistently omitted if desired. + + Disadvantages: + + - The rule would be more complex, and would be asymmetric: we must allow + closing square brackets before unspaced binary operators to permit + things like `arr[i]*3`. + - Would interact badly with expression forms that begin with a `[` or `{`, + for example `Time.Now()+{.seconds = 3}` or `names+["Lrrr"]`. + +## References + +- Proposal + [#601: Symbolic tokens](https://github.com/carbon-language/carbon-lang/pull/601) From 6b6aac37f7406943297aefa9a6273785ec39381e Mon Sep 17 00:00:00 2001 From: Aswin Shailajan <72661784+aswin2108@users.noreply.github.com> Date: Sun, 5 Mar 2023 21:27:23 +0530 Subject: [PATCH 02/13] Fixed typos! Co-authored-by: Avi Aaron <81820388+aviRon012@users.noreply.github.com> --- docs/design/lexical_conventions/symbolic_tokens.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/design/lexical_conventions/symbolic_tokens.md b/docs/design/lexical_conventions/symbolic_tokens.md index 22e191a432943..d766a4555afe1 100644 --- a/docs/design/lexical_conventions/symbolic_tokens.md +++ b/docs/design/lexical_conventions/symbolic_tokens.md @@ -23,9 +23,9 @@ SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception ## Overview Symbolic tokens are a set of tokens used to represent operators. Operators are -one use of symbolic tokens, but they are also used in patterns (:), declarations -(-> to indicate return type, to separate parameters), statements (;, =, and so -on), and other places (, to separate function call arguments). +one use of symbolic tokens, but they are also used in patterns `:`, declarations +(`->` to indicate return type, `,` to separate parameters), statements (`;`, `=`, and so +on), and other places (`,` to separate function call arguments). Some languages have a fixed set of symbolic tokens, For example: [C++ operators](https://eel.is/c++draft/lex.operators) and From 73f8d5035c1501fc4853e38e5a1973c6b00ad21f Mon Sep 17 00:00:00 2001 From: Aswin Shailajan Date: Mon, 13 Mar 2023 16:58:00 +0530 Subject: [PATCH 03/13] Fixed pre-commit issues --- docs/design/lexical_conventions/README.md | 4 +- .../lexical_conventions/symbolic_tokens.md | 83 ++----------------- 2 files changed, 10 insertions(+), 77 deletions(-) diff --git a/docs/design/lexical_conventions/README.md b/docs/design/lexical_conventions/README.md index e7a60cb2eadca..b18acfd08f090 100644 --- a/docs/design/lexical_conventions/README.md +++ b/docs/design/lexical_conventions/README.md @@ -10,9 +10,7 @@ SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception ## Table of contents -- [Lexical conventions](#lexical-conventions) - - [Table of contents](#table-of-contents) - - [Lexical elements](#lexical-elements) +- [Lexical elements](#lexical-elements) diff --git a/docs/design/lexical_conventions/symbolic_tokens.md b/docs/design/lexical_conventions/symbolic_tokens.md index d766a4555afe1..3916968569e19 100644 --- a/docs/design/lexical_conventions/symbolic_tokens.md +++ b/docs/design/lexical_conventions/symbolic_tokens.md @@ -1,5 +1,3 @@ - - # Symbolic Tokens - ## Table of contents -- Overview -- Details -- Alternatives considered -- References +- [Overview](#overview) +- [Details](#details) + - [Symbolic token list](#symbolic-token-list) + - [Whitespace](#whitespace) +- [Alternatives considered](#alternatives-considered) +- [References](#references) @@ -24,8 +23,8 @@ SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception Symbolic tokens are a set of tokens used to represent operators. Operators are one use of symbolic tokens, but they are also used in patterns `:`, declarations -(`->` to indicate return type, `,` to separate parameters), statements (`;`, `=`, and so -on), and other places (`,` to separate function call arguments). +(`->` to indicate return type, `,` to separate parameters), statements (`;`, +`=`, and so on), and other places (`,` to separate function call arguments). Some languages have a fixed set of symbolic tokens, For example: [C++ operators](https://eel.is/c++draft/lex.operators) and @@ -178,71 +177,7 @@ used as a postfix operator. ## Alternatives considered -- Lexing the longest sequence of symbolic characters rather than lexing only - the longest known operator. - - Advantages - Adding new operators could be done without any change to the - lexing rules. - If unknown operators are rejected, adding new operators - would carry no risk of changing the meaning of existing valid code. - - Disadvantages: - - - Sequences of prefix or postfix operators would require parentheses or - whitespace. For example, `Int**` would lex as `Int` followed by a single - `**` token, and `**p` would lex as a single `**` token followed by `p`, - if there is no `**` operator. While we could define `**`, `***`, and so - on as operators, doing so would add complexity and inconsistency to the - language rules. - -- Supporting an extensible operator set, giving the developer the option to - add new operators. Advantages: - - - This would increase expressivity, especially for embedded - domain-specific languages. - - Disadvantages: - - - This would harm readability, at least for those unfamiliar with the code - using the operators. - - This could harm our ability to evolve the language, by admitting the - possibility of a custom operator colliding with a newly-introduced - standard operator, although this risk could be reduced by providing a - separate lexical syntax for custom operators. - - We would need to either lex the longest sequence of symbolic characters - we can, which has the same disadvantage discussed for that approach - above, or use a more sophisticated rule to determine how to split - operators -- perhaps based on what operator overloads are in scope -- - increasing complexity. - -- We could apply different whitespace restrictions or no whitespace - restrictions - ([#520](https://github.com/carbon-language/carbon-lang/issues/520)). We - could require whitespace around a binary operator followed by `[` or `{`. In - particular, for examples such as: - - ``` - fn F() -> Int*{ return Null; } - var n: Int = pointer_to_array^[i]; - ``` - - This would allow us to form a unary operator instead of a binary operator, - which is likely to be more in line with the developer's expectations. - - Advantages: - - - Room to add a postfix `^` dereference operator, or similarly any other - postfix operator producing an array, without creating surprises for - pointers to arrays. - - Allows the whitespace before the `{` of a function body to be - consistently omitted if desired. - - Disadvantages: - - - The rule would be more complex, and would be asymmetric: we must allow - closing square brackets before unspaced binary operators to permit - things like `arr[i]*3`. - - Would interact badly with expression forms that begin with a `[` or `{`, - for example `Time.Now()+{.seconds = 3}` or `names+["Lrrr"]`. +- [Proposal: p0601](/proposals/p0601.md#alternatives-considered) ## References From 8206dab5143b5554c705416aaf285de2d1bb9249 Mon Sep 17 00:00:00 2001 From: Aswin Shailajan Date: Sat, 1 Apr 2023 16:23:16 +0530 Subject: [PATCH 04/13] Added reviewed changes and revamped whitespace section --- docs/design/lexical_conventions/README.md | 2 + .../lexical_conventions/symbolic_tokens.md | 157 ++++++------------ 2 files changed, 55 insertions(+), 104 deletions(-) diff --git a/docs/design/lexical_conventions/README.md b/docs/design/lexical_conventions/README.md index 78a5e775095d5..32d660b6c83c9 100644 --- a/docs/design/lexical_conventions/README.md +++ b/docs/design/lexical_conventions/README.md @@ -25,8 +25,10 @@ A _lexical element_ is one of the following: - a maximal sequence of [whitespace](whitespace.md) characters - a [word](words.md) - a literal: + - a [numeric literal](numeric_literals.md) - a [string literal](string_literals.md) + - a [comment](comments.md) - a [symbolic token](symbolic_tokens.md) diff --git a/docs/design/lexical_conventions/symbolic_tokens.md b/docs/design/lexical_conventions/symbolic_tokens.md index 3916968569e19..a9f08fe91a65e 100644 --- a/docs/design/lexical_conventions/symbolic_tokens.md +++ b/docs/design/lexical_conventions/symbolic_tokens.md @@ -21,19 +21,13 @@ SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception ## Overview -Symbolic tokens are a set of tokens used to represent operators. Operators are -one use of symbolic tokens, but they are also used in patterns `:`, declarations -(`->` to indicate return type, `,` to separate parameters), statements (`;`, -`=`, and so on), and other places (`,` to separate function call arguments). - -Some languages have a fixed set of symbolic tokens, For example: -[C++ operators](https://eel.is/c++draft/lex.operators) and -[rust operators](https://doc.rust-lang.org/book/appendix-02-operators.html). -While some others have extensible rules for defining operators, including the -facility for a developer to define operators that aren't part of the -baselanguage. For example: -[Swift operator rules](https://docs.swift.org/swift-book/ReferenceManual/LexicalStructure.html#ID418), -[Haskell operator rules](https://www.haskell.org/onlinereport/haskell2010/haskellch2.html#dx7-18008). +A _symbolic token_ is one of a fixed set of +[tokens](https://en.wikipedia.org/wiki/Lexical_analysis#Token) that consist of +characters that are not valid in identifiers, that is they are tokens consisting +of symbols, not letters or numbers. Operators are one use of symbolic tokens, +but they are also used in patterns `:`, declarations (`->` to indicate return +type, `,` to separate parameters), statements (`;`, `=`, and so on), and other +places (`,` to separate function call arguments). Carbon has a fixed set of tokens that represent operators, defined by the language specification. Developers cannot define new tokens to represent new @@ -43,17 +37,8 @@ Symbolic tokens are lexed using a "max munch" rule: at each lexing step, the longest symbolic token defined by the language specification that appears starting at the current input position is lexed, if any. -Not all uses of symbolic tokens within the Carbon grammar will be treated as -operators. For example, `(` and `)` tokens serves to delimit various grammar -productions, and we may not want to consider `.` to be an operator, because its -right "operand" is not an expression. - -The presence or absence of whitespace around the symbolic token is used to -determine its fixity, in the same way we expect a human reader to recognize -them. For example, we want `a* - 4` to treat the `*` as a unary operator and the -`-` as a binary operator, while `a * -4` treats `*` as a mathematical operation -and `-` as the negative sign. Hence we can say that the whitespaces plays a -really important role here, and we use some rules to avoid confusion: +When a symbolic token is used as an operator, the surrounding whitespace must +follow certain rules: - There can be no whitespace between a unary operator and its operand. - The whitespace around a binary operator must be consistent: either there is @@ -82,14 +67,18 @@ source file: | `*` | Indirection, multiplication, and forming pointers | | `&` | Address-of or Bitwise AND | | `=` | Assignment | -| `->` | Return type and `p->x` equivalent to `(*p).x` (in C++) | +| `->` | Return type and indirect member access | | `=>` | Match syntax | -| `[]` | Subscript | -| `()` | Function call and function declaration | -| `{}` | Struct literals, blocks of control flow statements and the bodies of definitions (classes, functions, etc) | -| `,` | Separate arguments in a function call, elements of a tuple, or parameters of a function declaration | +| `[` | Subscript and used for deduced parameter lists | +| `]` | Subscript and used for deduced parameter lists | +| `(` | Separate tuple and struct elements | +| `)` | Separate tuple and struct elements | +| `{` | Struct literals, blocks of control flow statements and the bodies of definitions (classes, functions, etc) | +| `}` | Struct literals, blocks of control flow statements and the bodies of definitions (classes, functions, etc) | +| `,` | Separate tuple and struct elements | | `.` | Member access | -| `:` | Scope | +| `:` | Name bindings | +| `;` | Name bindings | This list is expected to grow over time as more symbolic tokens are required by language proposals. @@ -100,80 +89,40 @@ been added since that are not reflected here. ### Whitespace -to support the use of the same symbolic token as a prefix operator, an infix -operator, and a postfix operator (in some cases) we want a rule that allows us -to simply and unambiguously parse operators that might have all three fixities. - -For example, given the expression `a * - b`, there are two possible parses: - -- As `a * (- b)`, multiplying `a` by the negation of `b`. -- As `(a *) - b`, subtracting `b` from the pointer type `a *`. - -The chosen rule to distinguish such cases is to consider the presence or absence -of whitespace, as it strikes a good balance between simplicity and -expressiveness for the programmer and simplicity and good support for error -recovery in the implementation. Hence `a * -b` uses the first interpretation, -`a* - b` uses the second interpretation, and other combinations (`a*-b`, -`a *- b`, `a* -b`, `a * - b`, `a*- b`, `a *-b`) are rejected as errors. - -We require whitespace to be present or absent around the operator to indicate -its fixity, as this is a cue that a human reader would use to understand the -code: binary operators have whitespace on both sides, and unary operators lack -whitespace between the operator and its operand. - -But in some cases omitting the whitespace around a binary operator aids -readability, such as in expressions like `2*x*x + 3*x + 1`, hence we have an -allowance in such cases. In this case the operator with whitespace on neither -side, if the token immediately before the operator indicates it is the end of an -operand, and the token immediately after the operator indicates it is the -beginning of an operand, the operator is treated as binary. - -The defined set of tokens that constitutes the beginning or end of an operand -are: - -- Identifiers, as in `x*x + y*y`. -- Literals, as in `3*x + 4*y` or `"foo"+s`. -- Brackets of any kind, facing away from the operator, as in `f()*(n + 3)` or - `args[3]*{.real=4, .imag=1}`. - -For error recovery purposes, this rule functions best if no expression context -can be preceded by a token that looks like the end of an operand and no -expression context can be followed by a token that looks like the start of an -operand. One known exception to this is in function definitions: - -``` -fn F(p: Int *) -> Int * { return p; } -``` - -Both occurrences of `Int *` here are erroneous. The first is easy to detect and -diagnose, but the second is more challenging, if `{...}` is a valid expression -form. We expect to be able to easily distinguish between code blocks starting -with `{` and expressions starting with `{` for all cases other than `{}`. -However, the code block `{}` is not a reasonable body for a function with a -return type, so we expect errors involving a combination of misplaced whitespace -and `{}` to be rare, and we should be able to recover well from the remaining -cases. - -From the perspective of token formation, the whitespace rule means that there -are four _variants_ of each symbolic token: - -- A symbolic token with whitespace on both sides is a _binary_ variant of the - token. -- A symbolic token with whitespace on neither side, where the preceding token - is an identifier, literal, or closing bracket, and the following token is an - identifier, literal, or `(`, is also a _binary_ variant of the token. -- A symbolic token with whitespace on neither side that does not satisfy the - preceding rule is a _unary_ variant of the token. -- A symbolic token with whitespace on the left side only is a _prefix_ variant - of the token. -- A symbolic token with whitespace on the right side only is a _postfix_ - variant of the token. - -When used in non-operator contexts, any variant of a symbolic token is -acceptable. When used in operator contexts, only a binary variant of a token can -be used as a binary operator, only a prefix or unary variant of a token can be -used as a prefix operator, and only a postfix or unary variant of a token can be -used as a postfix operator. +Carbon's rule for whitespace around operators have been designed to allow the +same symbolic token to be used as a prefix operator, infix operator, and postfix +operator in some cases. To make parsing operators unambiguous, we require +whitespace to be present or absent around the operator to indicate its fixity, +with binary operators having whitespace on both sides, and unary operators +lacking whitespace between the operator and its operand. However, there are some +cases where omitting whitespace around a binary operator can aid readability, +such as in expressions like `2*x*x + 3*x + 1`. In such cases, the operator with +whitespace on neither side is treated as binary if the token immediately before +the operator indicates the end of an operand and the token immediately after +indicates the beginning of an operand. + +Identifiers, literals, and brackets of any kind, facing away from the operator, +are defined as tokens that indicate the beginning or end of an operand. For +error recovery purposes, no expression context can be preceded by a token that +looks like the end of an operand, and no expression context can be followed by a +token that looks like the start of an operand, except in function definitions +where `{}` is the body of the function. + +From the perspective of token formation, there are four variants of each +symbolic token: a binary variant with whitespace on both sides, a binary variant +with whitespace on neither side, a unary variant with whitespace on neither +side, and prefix and postfix variants with whitespace on the left and right +sides, respectively. In non-operator contexts, any variant of a symbolic token +is acceptable, but in operator contexts, only the appropriate variant can be +used. + +The whitespace rule was designed to strike a balance between simplicity and +expressiveness for the programmer, and simplicity and good support for error +recovery in the implementation. The rule's allowance for omitting whitespace +around binary operators aids readability, but it can cause errors if not used +carefully, particularly in function definitions. Despite this, the rule provides +the necessary cues for human readers to understand the code, while still +allowing for unambiguous parsing of operators. ## Alternatives considered From 64508f7a72d18902f44794ae80613a4feed0463c Mon Sep 17 00:00:00 2001 From: Aswin Shailajan Date: Tue, 18 Apr 2023 19:10:58 +0530 Subject: [PATCH 05/13] Resolved some reviews --- .../lexical_conventions/symbolic_tokens.md | 76 +++++-------------- 1 file changed, 18 insertions(+), 58 deletions(-) diff --git a/docs/design/lexical_conventions/symbolic_tokens.md b/docs/design/lexical_conventions/symbolic_tokens.md index a9f08fe91a65e..136df89279d96 100644 --- a/docs/design/lexical_conventions/symbolic_tokens.md +++ b/docs/design/lexical_conventions/symbolic_tokens.md @@ -13,7 +13,6 @@ SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception - [Overview](#overview) - [Details](#details) - [Symbolic token list](#symbolic-token-list) - - [Whitespace](#whitespace) - [Alternatives considered](#alternatives-considered) - [References](#references) @@ -23,15 +22,14 @@ SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception A _symbolic token_ is one of a fixed set of [tokens](https://en.wikipedia.org/wiki/Lexical_analysis#Token) that consist of -characters that are not valid in identifiers, that is they are tokens consisting -of symbols, not letters or numbers. Operators are one use of symbolic tokens, -but they are also used in patterns `:`, declarations (`->` to indicate return -type, `,` to separate parameters), statements (`;`, `=`, and so on), and other -places (`,` to separate function call arguments). +characters that are not valid in identifiers. That is, they are tokens +consisting of symbols, not letters or numbers. Operators are one use of symbolic +tokens, but they are also used in patterns `:`, declarations (`->` to indicate +return type, `,` to separate parameters), statements (`;`, `=`, and so on), and +other places (`,` to separate function call arguments). -Carbon has a fixed set of tokens that represent operators, defined by the -language specification. Developers cannot define new tokens to represent new -operators. +Carbon has a fixed set of symbolic tokens, defined by the language +specification. Developers cannot define new symbolic tokens in their own code. Symbolic tokens are lexed using a "max munch" rule: at each lexing step, the longest symbolic token defined by the language specification that appears @@ -49,6 +47,9 @@ follow certain rules: must be an identifier, a literal, or any kind of opening bracket (for example, `(`, `[`, or `{`). +These rules enable us to use a token like `*` as a prefix, infix, and postfix +operator, without creating ambiguity. + ## Details Symbolic tokens are intended to be used for widely-recognized operators, such as @@ -69,60 +70,19 @@ source file: | `=` | Assignment | | `->` | Return type and indirect member access | | `=>` | Match syntax | -| `[` | Subscript and used for deduced parameter lists | -| `]` | Subscript and used for deduced parameter lists | -| `(` | Separate tuple and struct elements | -| `)` | Separate tuple and struct elements | +| `[` | Subscript and deduced parameter lists | +| `]` | Subscript and deduced parameter lists | +| `(` | Function call, function declaration and tuple literals | +| `)` | Function call, function declaration and tuple literals | | `{` | Struct literals, blocks of control flow statements and the bodies of definitions (classes, functions, etc) | | `}` | Struct literals, blocks of control flow statements and the bodies of definitions (classes, functions, etc) | | `,` | Separate tuple and struct elements | | `.` | Member access | | `:` | Name bindings | -| `;` | Name bindings | - -This list is expected to grow over time as more symbolic tokens are required by -language proposals. - -Note: The above list only covers up to -[#601](https://github.com/carbon-language/carbon-lang/pull/601) and more have -been added since that are not reflected here. - -### Whitespace - -Carbon's rule for whitespace around operators have been designed to allow the -same symbolic token to be used as a prefix operator, infix operator, and postfix -operator in some cases. To make parsing operators unambiguous, we require -whitespace to be present or absent around the operator to indicate its fixity, -with binary operators having whitespace on both sides, and unary operators -lacking whitespace between the operator and its operand. However, there are some -cases where omitting whitespace around a binary operator can aid readability, -such as in expressions like `2*x*x + 3*x + 1`. In such cases, the operator with -whitespace on neither side is treated as binary if the token immediately before -the operator indicates the end of an operand and the token immediately after -indicates the beginning of an operand. - -Identifiers, literals, and brackets of any kind, facing away from the operator, -are defined as tokens that indicate the beginning or end of an operand. For -error recovery purposes, no expression context can be preceded by a token that -looks like the end of an operand, and no expression context can be followed by a -token that looks like the start of an operand, except in function definitions -where `{}` is the body of the function. - -From the perspective of token formation, there are four variants of each -symbolic token: a binary variant with whitespace on both sides, a binary variant -with whitespace on neither side, a unary variant with whitespace on neither -side, and prefix and postfix variants with whitespace on the left and right -sides, respectively. In non-operator contexts, any variant of a symbolic token -is acceptable, but in operator contexts, only the appropriate variant can be -used. - -The whitespace rule was designed to strike a balance between simplicity and -expressiveness for the programmer, and simplicity and good support for error -recovery in the implementation. The rule's allowance for omitting whitespace -around binary operators aids readability, but it can cause errors if not used -carefully, particularly in function definitions. Despite this, the rule provides -the necessary cues for human readers to understand the code, while still -allowing for unambiguous parsing of operators. +| `;` | Statement separator | + +TODO: Arithmetic operators, Bitwise operators, Comparison operators & :! +[#2657](https://github.com/carbon-language/carbon-lang/pull/2657/files#r1137826711) ## Alternatives considered From f726bd1b6ea8fa1427a1a7b2438e9be6bf3dd87c Mon Sep 17 00:00:00 2001 From: Aswin Shailajan Date: Sat, 29 Apr 2023 13:41:10 +0530 Subject: [PATCH 06/13] Added missing tokens to the table --- .../lexical_conventions/symbolic_tokens.md | 49 ++++++++++++------- 1 file changed, 32 insertions(+), 17 deletions(-) diff --git a/docs/design/lexical_conventions/symbolic_tokens.md b/docs/design/lexical_conventions/symbolic_tokens.md index 136df89279d96..dfcb107552902 100644 --- a/docs/design/lexical_conventions/symbolic_tokens.md +++ b/docs/design/lexical_conventions/symbolic_tokens.md @@ -63,23 +63,38 @@ overloading. The following is the initial list of symbolic tokens recognized in a Carbon source file: -| Token | Explanation | -| ----- | ---------------------------------------------------------------------------------------------------------- | -| `*` | Indirection, multiplication, and forming pointers | -| `&` | Address-of or Bitwise AND | -| `=` | Assignment | -| `->` | Return type and indirect member access | -| `=>` | Match syntax | -| `[` | Subscript and deduced parameter lists | -| `]` | Subscript and deduced parameter lists | -| `(` | Function call, function declaration and tuple literals | -| `)` | Function call, function declaration and tuple literals | -| `{` | Struct literals, blocks of control flow statements and the bodies of definitions (classes, functions, etc) | -| `}` | Struct literals, blocks of control flow statements and the bodies of definitions (classes, functions, etc) | -| `,` | Separate tuple and struct elements | -| `.` | Member access | -| `:` | Name bindings | -| `;` | Statement separator | +| Symbolic Tokens and Explanation | +| -------------------------------------------------------------------------------------------------------------- | ------------ | +| `+` Addition | +| `-` Subtraction and negation | +| `*` Indirection, multiplication, and forming pointers | +| `/` Division | +| `%` Modulus | +| `=` Assignment | +| `^` Complementing and Bitwise XOR | +| `&` Address-of and Bitwise AND | +| ` | ` Bitwise OR | +| `<<` Arithmetic and Logical Left-shift | +| `>>` Arithmetic and Logical Right-shift | +| `==` Equality or equal to | +| `!=` Inequality or not equal to | +| `>` Greater than | +| `>=` Greater than or equal to | +| `<` Less than | +| `<=` Less than or equal to | +| `->` Return type and indirect member access | +| `=>` Match syntax | +| `[` Subscript and deduced parameter lists | +| `]` Subscript and deduced parameter lists | +| `(` Function call, function declaration and tuple literals | +| `)` Function call, function declaration and tuple literals | +| `{` Struct literals, blocks of control flow statements and the bodies of definitions (classes, functions, etc) | +| `}` Struct literals, blocks of control flow statements and the bodies of definitions (classes, functions, etc) | +| `,` Separate tuple and struct elements | +| `.` Member access | +| `:` Name bindings | +| `;` Statement separator | +| `:!` Type-checking | TODO: Arithmetic operators, Bitwise operators, Comparison operators & :! [#2657](https://github.com/carbon-language/carbon-lang/pull/2657/files#r1137826711) From e0c7c70d8d7927dda39510a36c2e4fdf49458036 Mon Sep 17 00:00:00 2001 From: Aswin Shailajan Date: Fri, 12 May 2023 06:23:23 +0530 Subject: [PATCH 07/13] Fixed the table --- docs/design/lexical_conventions/symbolic_tokens.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/design/lexical_conventions/symbolic_tokens.md b/docs/design/lexical_conventions/symbolic_tokens.md index dfcb107552902..93b5db8d48dcb 100644 --- a/docs/design/lexical_conventions/symbolic_tokens.md +++ b/docs/design/lexical_conventions/symbolic_tokens.md @@ -64,16 +64,16 @@ The following is the initial list of symbolic tokens recognized in a Carbon source file: | Symbolic Tokens and Explanation | -| -------------------------------------------------------------------------------------------------------------- | ------------ | +| -------------------------------------------------------------------------------------------------------------- | | `+` Addition | | `-` Subtraction and negation | -| `*` Indirection, multiplication, and forming pointers | +| `*` Indirection, multiplication, and forming pointer types | | `/` Division | | `%` Modulus | | `=` Assignment | | `^` Complementing and Bitwise XOR | | `&` Address-of and Bitwise AND | -| ` | ` Bitwise OR | +| `\|` Bitwise OR | | `<<` Arithmetic and Logical Left-shift | | `>>` Arithmetic and Logical Right-shift | | `==` Equality or equal to | @@ -93,8 +93,8 @@ source file: | `,` Separate tuple and struct elements | | `.` Member access | | `:` Name bindings | +| `:!` Generic binding | | `;` Statement separator | -| `:!` Type-checking | TODO: Arithmetic operators, Bitwise operators, Comparison operators & :! [#2657](https://github.com/carbon-language/carbon-lang/pull/2657/files#r1137826711) From 9273b02659693f843971f7f8a1b2f22d563273e8 Mon Sep 17 00:00:00 2001 From: Aswin Shailajan Date: Sat, 13 May 2023 08:34:07 +0530 Subject: [PATCH 08/13] Added missing seperators --- .../lexical_conventions/symbolic_tokens.md | 67 +++++++++---------- 1 file changed, 32 insertions(+), 35 deletions(-) diff --git a/docs/design/lexical_conventions/symbolic_tokens.md b/docs/design/lexical_conventions/symbolic_tokens.md index 93b5db8d48dcb..b33824a51ccd7 100644 --- a/docs/design/lexical_conventions/symbolic_tokens.md +++ b/docs/design/lexical_conventions/symbolic_tokens.md @@ -63,41 +63,38 @@ overloading. The following is the initial list of symbolic tokens recognized in a Carbon source file: -| Symbolic Tokens and Explanation | -| -------------------------------------------------------------------------------------------------------------- | -| `+` Addition | -| `-` Subtraction and negation | -| `*` Indirection, multiplication, and forming pointer types | -| `/` Division | -| `%` Modulus | -| `=` Assignment | -| `^` Complementing and Bitwise XOR | -| `&` Address-of and Bitwise AND | -| `\|` Bitwise OR | -| `<<` Arithmetic and Logical Left-shift | -| `>>` Arithmetic and Logical Right-shift | -| `==` Equality or equal to | -| `!=` Inequality or not equal to | -| `>` Greater than | -| `>=` Greater than or equal to | -| `<` Less than | -| `<=` Less than or equal to | -| `->` Return type and indirect member access | -| `=>` Match syntax | -| `[` Subscript and deduced parameter lists | -| `]` Subscript and deduced parameter lists | -| `(` Function call, function declaration and tuple literals | -| `)` Function call, function declaration and tuple literals | -| `{` Struct literals, blocks of control flow statements and the bodies of definitions (classes, functions, etc) | -| `}` Struct literals, blocks of control flow statements and the bodies of definitions (classes, functions, etc) | -| `,` Separate tuple and struct elements | -| `.` Member access | -| `:` Name bindings | -| `:!` Generic binding | -| `;` Statement separator | - -TODO: Arithmetic operators, Bitwise operators, Comparison operators & :! -[#2657](https://github.com/carbon-language/carbon-lang/pull/2657/files#r1137826711) +| Symbolic Tokens | Explanation | +| --------------- | ---------------------------------------------------------------------------------------------------------- | +| `+` | Addition | +| `-` | Subtraction and negation | +| `*` | Indirection, multiplication, and forming pointer types | +| `/` | Division | +| `%` | Modulus | +| `=` | Assignment | +| `^` | Complementing and Bitwise XOR | +| `&` | Address-of and Bitwise AND | +| `\|` | Bitwise OR | +| `<<` | Arithmetic and Logical Left-shift | +| `>>` | Arithmetic and Logical Right-shift | +| `==` | Equality or equal to | +| `!=` | Inequality or not equal to | +| `>` | Greater than | +| `>=` | Greater than or equal to | +| `<` | Less than | +| `<=` | Less than or equal to | +| `->` | Return type and indirect member access | +| `=>` | Match syntax | +| `[` | Subscript and deduced parameter lists | +| `]` | Subscript and deduced parameter lists | +| `(` | Function call, function declaration and tuple literals | +| `)` | Function call, function declaration and tuple literals | +| `{` | Struct literals, blocks of control flow statements and the bodies of definitions (classes, functions, etc) | +| `}` | Struct literals, blocks of control flow statements and the bodies of definitions (classes, functions, etc) | +| `,` | Separate tuple and struct elements | +| `.` | Member access | +| `:` | Name bindings | +| `:!` | Generic binding | +| `;` | Statement separator | ## Alternatives considered From 5eba8b94ae2307712285aefe8ace9e3712a702e4 Mon Sep 17 00:00:00 2001 From: Aswin Shailajan Date: Wed, 24 May 2023 13:47:51 +0530 Subject: [PATCH 09/13] Single table row lists both delimiters --- docs/design/lexical_conventions/symbolic_tokens.md | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/docs/design/lexical_conventions/symbolic_tokens.md b/docs/design/lexical_conventions/symbolic_tokens.md index b33824a51ccd7..ef56b0acfcf64 100644 --- a/docs/design/lexical_conventions/symbolic_tokens.md +++ b/docs/design/lexical_conventions/symbolic_tokens.md @@ -84,12 +84,9 @@ source file: | `<=` | Less than or equal to | | `->` | Return type and indirect member access | | `=>` | Match syntax | -| `[` | Subscript and deduced parameter lists | -| `]` | Subscript and deduced parameter lists | -| `(` | Function call, function declaration and tuple literals | -| `)` | Function call, function declaration and tuple literals | -| `{` | Struct literals, blocks of control flow statements and the bodies of definitions (classes, functions, etc) | -| `}` | Struct literals, blocks of control flow statements and the bodies of definitions (classes, functions, etc) | +| `[` and `]` | Subscript and deduced parameter lists | +| `(` and `)` | Function call, function declaration and tuple literals | +| `{` and `}` | Struct literals, blocks of control flow statements and the bodies of definitions (classes, functions, etc) | | `,` | Separate tuple and struct elements | | `.` | Member access | | `:` | Name bindings | From 4d05316997c1b2d9c270dec3d2467327b85f8843 Mon Sep 17 00:00:00 2001 From: Aswin Shailajan Date: Thu, 1 Jun 2023 06:28:21 +0530 Subject: [PATCH 10/13] Added TODO message --- docs/design/lexical_conventions/symbolic_tokens.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/docs/design/lexical_conventions/symbolic_tokens.md b/docs/design/lexical_conventions/symbolic_tokens.md index ef56b0acfcf64..b66490c778f07 100644 --- a/docs/design/lexical_conventions/symbolic_tokens.md +++ b/docs/design/lexical_conventions/symbolic_tokens.md @@ -93,6 +93,10 @@ source file: | `:!` | Generic binding | | `;` | Statement separator | +TODO: The assignment operators in +[#2511](https://github.com/carbon-language/carbon-lang/pull/2511) are still to +be added. + ## Alternatives considered - [Proposal: p0601](/proposals/p0601.md#alternatives-considered) From bb3f2a9f84ddfca2c843fa7e736bb22dcc293813 Mon Sep 17 00:00:00 2001 From: Aswin Shailajan Date: Thu, 1 Jun 2023 08:37:26 +0530 Subject: [PATCH 11/13] Fixed punctuation mistakes --- .../lexical_conventions/symbolic_tokens.md | 58 +++++++++---------- 1 file changed, 29 insertions(+), 29 deletions(-) diff --git a/docs/design/lexical_conventions/symbolic_tokens.md b/docs/design/lexical_conventions/symbolic_tokens.md index b66490c778f07..c358ece81a1a5 100644 --- a/docs/design/lexical_conventions/symbolic_tokens.md +++ b/docs/design/lexical_conventions/symbolic_tokens.md @@ -63,35 +63,35 @@ overloading. The following is the initial list of symbolic tokens recognized in a Carbon source file: -| Symbolic Tokens | Explanation | -| --------------- | ---------------------------------------------------------------------------------------------------------- | -| `+` | Addition | -| `-` | Subtraction and negation | -| `*` | Indirection, multiplication, and forming pointer types | -| `/` | Division | -| `%` | Modulus | -| `=` | Assignment | -| `^` | Complementing and Bitwise XOR | -| `&` | Address-of and Bitwise AND | -| `\|` | Bitwise OR | -| `<<` | Arithmetic and Logical Left-shift | -| `>>` | Arithmetic and Logical Right-shift | -| `==` | Equality or equal to | -| `!=` | Inequality or not equal to | -| `>` | Greater than | -| `>=` | Greater than or equal to | -| `<` | Less than | -| `<=` | Less than or equal to | -| `->` | Return type and indirect member access | -| `=>` | Match syntax | -| `[` and `]` | Subscript and deduced parameter lists | -| `(` and `)` | Function call, function declaration and tuple literals | -| `{` and `}` | Struct literals, blocks of control flow statements and the bodies of definitions (classes, functions, etc) | -| `,` | Separate tuple and struct elements | -| `.` | Member access | -| `:` | Name bindings | -| `:!` | Generic binding | -| `;` | Statement separator | +| Symbolic Tokens | Explanation | +| --------------- | ------------------------------------------------------------------------------------------------------------ | +| `+` | Addition | +| `-` | Subtraction and negation | +| `*` | Indirection, multiplication, and forming pointer types | +| `/` | Division | +| `%` | Modulus | +| `=` | Assignment | +| `^` | Complementing and Bitwise XOR | +| `&` | Address-of and Bitwise AND | +| `\|` | Bitwise OR | +| `<<` | Arithmetic and Logical Left-shift | +| `>>` | Arithmetic and Logical Right-shift | +| `==` | Equality or equal to | +| `!=` | Inequality or not equal to | +| `>` | Greater than | +| `>=` | Greater than or equal to | +| `<` | Less than | +| `<=` | Less than or equal to | +| `->` | Return type and indirect member access | +| `=>` | Match syntax | +| `[` and `]` | Subscript and deduced parameter lists | +| `(` and `)` | Function call, function declaration, and tuple literals | +| `{` and `}` | Struct literals, blocks of control flow statements, and the bodies of definitions (classes, functions, etc.) | +| `,` | Separate tuple and struct elements | +| `.` | Member access | +| `:` | Name bindings | +| `:!` | Generic binding | +| `;` | Statement separator | TODO: The assignment operators in [#2511](https://github.com/carbon-language/carbon-lang/pull/2511) are still to From f56561d312da0a911d14600fa7c2da3e4702499e Mon Sep 17 00:00:00 2001 From: Aswin Shailajan Date: Thu, 1 Jun 2023 09:11:47 +0530 Subject: [PATCH 12/13] Edited the details section --- docs/design/lexical_conventions/symbolic_tokens.md | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/docs/design/lexical_conventions/symbolic_tokens.md b/docs/design/lexical_conventions/symbolic_tokens.md index c358ece81a1a5..9acd306bf29c7 100644 --- a/docs/design/lexical_conventions/symbolic_tokens.md +++ b/docs/design/lexical_conventions/symbolic_tokens.md @@ -52,11 +52,8 @@ operator, without creating ambiguity. ## Details -Symbolic tokens are intended to be used for widely-recognized operators, such as -the mathematical operators `+`, `*`, `<`, and so on. Those used as operators -would generally be expected to also be meaningful for some user-defined types, -and should be candidates for being made overloadable once we support operator -overloading. +Symbolic tokens are used for widely-recognized operators, these should be +meaningful and should be overloadable. ### Symbolic token list From 2068acd6a5925053216f5e1d80379696db0c02b0 Mon Sep 17 00:00:00 2001 From: Aswin Shailajan Date: Fri, 2 Jun 2023 06:39:34 +0530 Subject: [PATCH 13/13] Removed unwanted lines --- docs/design/lexical_conventions/symbolic_tokens.md | 3 --- 1 file changed, 3 deletions(-) diff --git a/docs/design/lexical_conventions/symbolic_tokens.md b/docs/design/lexical_conventions/symbolic_tokens.md index 9acd306bf29c7..6986eabc9747c 100644 --- a/docs/design/lexical_conventions/symbolic_tokens.md +++ b/docs/design/lexical_conventions/symbolic_tokens.md @@ -52,9 +52,6 @@ operator, without creating ambiguity. ## Details -Symbolic tokens are used for widely-recognized operators, these should be -meaningful and should be overloadable. - ### Symbolic token list The following is the initial list of symbolic tokens recognized in a Carbon