Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improper parsing of symbols #29

Open
2 of 6 tasks
Hirevo opened this issue Jan 25, 2023 · 3 comments
Open
2 of 6 tasks

Improper parsing of symbols #29

Hirevo opened this issue Jan 25, 2023 · 3 comments
Assignees
Labels
C-bug Category: Bugs M-lexer Module: Lexer M-parser Module: Parser P-medium Priority: Medium

Comments

@Hirevo
Copy link
Owner

Hirevo commented Jan 25, 2023

The current state of how symbols are parsed in both interpreters in som-rs is somewhat non-standard, compared to other SOMs.

This issue stands to track the cases where som-rs behaves differently from other SOMs, in order to get them all fixed.

Here are the problematic cases that I am currently aware of:

  • Spaces between # and identifier (ex: # foo, accepted by most SOMs, rejected by som-rs)
  • Spaces between # and operator (ex: # +, accepted by most SOMs, rejected by som-rs)
  • Spaces between # and string literal (ex: # 'foo', accepted by most SOMs, rejected by som-rs)
  • Non-leading successive colons in selector (ex: #foo::, rejected by most SOMs, accepted by som-rs)
  • Leading digits after colons (ex: #foo:2:, rejected by most SOMs, accepted by som-rs)

Somewhat related to this issue is the situation with array literals, which suffer from a similar problem due to also using the # in the syntax:

  • Spaces between # and ( (ex: # (1 2 3), accepted by most SOMs, rejected by som-rs)

Most of these issues are due to the fact that the lexer is currently tokenizing the whole symbol at once (as: Token::Symbol(String)) instead of simply outputting its fragments (something like: [Token::Pound, Token::Selector(String)]).
Delegating the construction of the symbol to the parser would likely be the way forward to address these problems.

@Hirevo Hirevo added C-bug Category: Bugs M-lexer Module: Lexer M-parser Module: Parser P-medium Priority: Medium labels Jan 25, 2023
@Hirevo Hirevo self-assigned this Jan 25, 2023
@smarr
Copy link
Contributor

smarr commented Jan 26, 2023

Hmmm. Interesting. I am not sure how I feel about these things.

I think we need more tests :)
Especially the situation around spaces is a little odd and an artifact of having a separate lexer in most SOM implementations. The lexer simply discards the space.
But the Smalltalk grammar (ANSI Smalltalk) doesn't explicitly mention spaces, instead it says that a quoted string is to be immediately preceded by a pound sign.

Squeak has the same behavior as SOM, allowing spaces, but it really looks odd to me, and Pharo seems to have fixed it, disallowing spaces between # and the rest of the symbol.

# foo just doesn't look right. The # could be misread as an operator here, for instance in something like 54 # bar, which should be a parse error, because 54 #bar is not a valid expression.

@smarr
Copy link
Contributor

smarr commented Jan 26, 2023

Hmm. I think the biggest problem with this at the moment is that we don't have a cross-SOM mechanism to test for parser errors.

@Hirevo
Copy link
Owner Author

Hirevo commented Sep 18, 2024

Sorry about notifications, I was looking at this issue on my phone and forgot to lock the screen when putting it back in my pocket, so some unintentional inputs got registered and it inadvertently posted some comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category: Bugs M-lexer Module: Lexer M-parser Module: Parser P-medium Priority: Medium
Projects
None yet
Development

No branches or pull requests

2 participants