-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lookahead could be useful #29
Comments
Alternatively, if we implement #9 we could simply bind the integer part to a variable and use it in the semantic action. Something like: int:($dec_digit ($dec_digit | '_')* $int_suffix?) ($whitespace | $) => ... use variable `int` ..., |
I think the "followed by" syntax should actually be called "lookahead" and it shouldn't consume the matched part of the input (i.e. backtrack after the match). As an example, rustc's lexer does not include newlines at the end of single-line comments. Doing this is currently a bit verbose in lexgen: rule SinglelineComment {
_ # '\n' => |lexer| {
if lexer.peek() == Some('\n') {
let comment = lexer.match_();
lexer.switch_and_return(LexerRule::Init, Token::Comment(comment))
} else {
lexer.continue_()
}
},
$ => |lexer| {
let comment = lexer.match_();
lexer.switch_and_return(LexerRule::Init, Token::Comment(comment))
},
} With the lookahead syntax suggested above, it would look like: rule SinglelineComment {
(_ # '\n')* > ('\n' | $) => |lexer| {
let comment = lexer.match_();
lexer.switch_and_return(LexerRule::Init, Token::Comment(comment))
}
} After this simplification, a new rule set is not necessary anymore, we can inline this rule in the use site: rule Init {
...
"//" (_ # '\n')* > ('\n' | $) => |lexer| {
let comment = lexer.match_();
lexer.return_(Token::Comment(comment))
},
} |
Another use of lookahead would be when lexing Rust floats. In Rust,
"not immediately followed by" part could be implemented using the lookahead operator: rule Init {
...
$dec_literal '.' > (_ # ('.' | '_' | $$XID_Start) | $) => |lexer| {
let match_ = lexer.match_();
lexer.return_(Token::Lit(Lit::Float(match_)))
},
} |
This feature doesn't seem to be "zero cost", in the sense that if we implement it everyone will pay the price in terms of runtime, even if they don't use the feature. Here's the problem. Suppose I'm macthing these two rules: ("a" > "aa") "aab" => ...,
"aaac" => ..., First rule matches "aaab", second rule matches "aaac". Now let's imagine how an NFA for these rules would work. After scanning the prefix "aaa", for the first rule we need to backtrack, but for the second rule we need to continue. So the state machine needs to maintain separate iterators for these rules. For the first rule, the iterator will be reverted. In the second rule, we will increment the iterator. This is already too complicated, and I don't think we can avoid "forking" the iterator to implement this in the generated code. But there's more. Because the rule that comes first has the priority, in this lexer, we want to run the semantic action for the first rule. But if we keep running these "forked" state machines concurrently, one character at a time, the second rule will match first. Because it will see "b" and accept the input, while the state machine for the first rule will see the second "a" (because we reverted the iterator). So when the user calls It's complicated to implement, and it will be costly to everyone, even to users who don't use this feature. |
There could be other implementations that do not fit into the NFA/DFA model. For example, in the example above, when we see the first "a", we could run a new state machine to match the lookahead part. We then move to states for the rest of the both rules, or just the second rule, depending on the result of that state machine. Another idea could be implementing a VM with a "split" or "fork" instruction. In both of these cases the worst case time will be bad though. When we have multiple lookaheads, we will run each of them separately, visiting rest of the input again and again for each one of them. |
alex provides this feature, with the name "right context". See https://www.haskell.org/alex/doc/html/alex-files.html section 3.2.2.1. |
As far as I understand, "right context" is a simpler version of lookahead. They're not part of the regex syntax, so cannot be nested within regexes. They're more like guards in pattern matching. Similar to how each alternative in pattern matching can have one guard, each rule can have one right context. When the rule matches, the right context is matched before the rule is successful. If it doesn't match, then the rule is not successful. I think the implementation is straightforward. Compile all right contexts to separate NFAs/DFAs. Add an optional right context (reference to the state machine initial state) to accepting states. When we're in an accepting state, before updating |
I started implementing right contexts in #37. NFA simulation is straightforward, but DFA simulation requires some changes in DFA accepting states. Suppose I have these rules:
( In the NFA we will have 27 epsilon transitions to each of these rules. First 26 rules will have a right context. The last rule won't have a right context. When the input is In the DFA, without right contexts, in an accepting state we simply update One design question here is what to do when we have something like
and the input is
I feel like (2) will be more intuitive. |
Suppose I'm writing a Rust lexer. Naively, lexing integer literals could be implemented like this:
The problem with these rules is they generate multiple tokens for invalid literals like
0invalidSuffix
or123AFB43
.Currently the way to fix this is by introducing new rules for each of these literals:
What these rules do is when we see a character for an integer literal, we consume every character until the end of the input or we see a whitespace. Seeing an invalid character for the literal becomes an error.
If we had a regex for "followed by", we could use the rules in the first version, with a "followed by whitespace or end-of-input" part:
where
>
is the "followed by" syntax.The difference from concatenation is the "followed by" part should not be included in the match.
The text was updated successfully, but these errors were encountered: