Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lookahead could be useful #29

Closed
osa1 opened this issue Oct 26, 2021 · 8 comments · Fixed by #37
Closed

Lookahead could be useful #29

osa1 opened this issue Oct 26, 2021 · 8 comments · Fixed by #37
Labels
design feature New feature or request

Comments

@osa1
Copy link
Owner

osa1 commented Oct 26, 2021

Suppose I'm writing a Rust lexer. Naively, lexing integer literals could be implemented like this:

$dec_digit ($dec_digit | '_')* $int_suffix? => ...,

"0b" ($bin_digit | '_')* $bin_digit ($bin_digit | '_')* $int_suffix? => ...,

"0o" ($oct_digit | '_')* $oct_digit ($oct_digit | '_')* $int_suffix? => ...,

"0x" ($hex_digit | '_')* $hex_digit ($hex_digit | '_')* $int_suffix? => ...,

The problem with these rules is they generate multiple tokens for invalid literals like 0invalidSuffix or 123AFB43.

Currently the way to fix this is by introducing new rules for each of these literals:

rule Init {
    ...

    $dec_digit => |lexer| lexer.switch(LexerRule::DecInt),

    "0b" => |lexer| lexer.switch(LexerRule::BinInt),

    "0o" => |lexer| lexer.switch(LexerRule::OctInt),

    "0x" => |lexer| lexer.switch(LexerRule::HexInt),
}

rule DecInt {
    ($dec_digit | '_')* $int_suffix?,

    $ => ... return token ...,

    $whitespace => ... return token ...,
}

rule BinInt { ... }

rule OctInt { ... }

rule HexInt { ... }

What these rules do is when we see a character for an integer literal, we consume every character until the end of the input or we see a whitespace. Seeing an invalid character for the literal becomes an error.

If we had a regex for "followed by", we could use the rules in the first version, with a "followed by whitespace or end-of-input" part:

$dec_digit ($dec_digit | '_')* $int_suffix? > ($whitespace | $) => ...,

where > is the "followed by" syntax.

The difference from concatenation is the "followed by" part should not be included in the match.

@osa1 osa1 added feature New feature or request design labels Oct 26, 2021
@osa1
Copy link
Owner Author

osa1 commented Oct 26, 2021

Alternatively, if we implement #9 we could simply bind the integer part to a variable and use it in the semantic action. Something like:

int:($dec_digit ($dec_digit | '_')* $int_suffix?) ($whitespace | $) => ... use variable `int` ...,

@osa1
Copy link
Owner Author

osa1 commented Nov 24, 2021

I think the "followed by" syntax should actually be called "lookahead" and it shouldn't consume the matched part of the input (i.e. backtrack after the match).

As an example, rustc's lexer does not include newlines at the end of single-line comments. Doing this is currently a bit verbose in lexgen:

rule SinglelineComment {
    _ # '\n' => |lexer| {
        if lexer.peek() == Some('\n') {
            let comment = lexer.match_();
            lexer.switch_and_return(LexerRule::Init, Token::Comment(comment))
        } else {
            lexer.continue_()
        }
    },

    $ => |lexer| {
        let comment = lexer.match_();
        lexer.switch_and_return(LexerRule::Init, Token::Comment(comment))
    },
}

With the lookahead syntax suggested above, it would look like:

rule SinglelineComment {
    (_ # '\n')* > ('\n' | $) => |lexer| {
        let comment = lexer.match_();
        lexer.switch_and_return(LexerRule::Init, Token::Comment(comment))
    }
}

After this simplification, a new rule set is not necessary anymore, we can inline this rule in the use site:

rule Init {
    ...

    "//" (_ # '\n')* > ('\n' | $) => |lexer| {
        let comment = lexer.match_();
        lexer.return_(Token::Comment(comment))
    },
}

@osa1
Copy link
Owner Author

osa1 commented Nov 24, 2021

Another use of lookahead would be when lexing Rust floats.

In Rust, 1. is a float, but 1.. is an integer, and then .. (the range operator, std::ops::Range, std::ops::RangeFrom, etc.). The reference describes this syntax with:

DEC_LITERAL . (not immediately followed by ., _, or an identifier)

"not immediately followed by" part could be implemented using the lookahead operator:

rule Init {
    ...
    $dec_literal '.' > (_ # ('.' | '_' | $$XID_Start) | $) => |lexer| {
        let match_ = lexer.match_();
        lexer.return_(Token::Lit(Lit::Float(match_)))
    },
}

@osa1
Copy link
Owner Author

osa1 commented Nov 25, 2021

This feature doesn't seem to be "zero cost", in the sense that if we implement it everyone will pay the price in terms of runtime, even if they don't use the feature.

Here's the problem. Suppose I'm macthing these two rules:

("a" > "aa") "aab" => ...,
"aaac" => ...,

First rule matches "aaab", second rule matches "aaac".

Now let's imagine how an NFA for these rules would work. After scanning the prefix "aaa", for the first rule we need to backtrack, but for the second rule we need to continue. So the state machine needs to maintain separate iterators for these rules. For the first rule, the iterator will be reverted. In the second rule, we will increment the iterator.

This is already too complicated, and I don't think we can avoid "forking" the iterator to implement this in the generated code. But there's more. Because the rule that comes first has the priority, in this lexer, we want to run the semantic action for the first rule. But if we keep running these "forked" state machines concurrently, one character at a time, the second rule will match first. Because it will see "b" and accept the input, while the state machine for the first rule will see the second "a" (because we reverted the iterator).

So when the user calls next(), we will have a set of iterators. We need to find the one that is behind all others, and run it until it catches the other iterators. When all iterators are at the same location in input, we make progress in all state machines. We also need to assign a priority to all states (just give numbers to the rules, rule that comes later in code will have larger number/less priority) so that if multiple machines accept the input at the same time we can pick the one with highest priority (i.e. comes first in lexer definition).

It's complicated to implement, and it will be costly to everyone, even to users who don't use this feature.

@osa1 osa1 changed the title Regex for "followed by" could be useful Lookahead could be useful Nov 25, 2021
@osa1
Copy link
Owner Author

osa1 commented Nov 25, 2021

There could be other implementations that do not fit into the NFA/DFA model.

For example, in the example above, when we see the first "a", we could run a new state machine to match the lookahead part. We then move to states for the rest of the both rules, or just the second rule, depending on the result of that state machine.

Another idea could be implementing a VM with a "split" or "fork" instruction.

In both of these cases the worst case time will be bad though. When we have multiple lookaheads, we will run each of them separately, visiting rest of the input again and again for each one of them.

@osa1
Copy link
Owner Author

osa1 commented Dec 2, 2021

alex provides this feature, with the name "right context". See https://www.haskell.org/alex/doc/html/alex-files.html section 3.2.2.1.

@osa1
Copy link
Owner Author

osa1 commented Dec 9, 2021

As far as I understand, "right context" is a simpler version of lookahead. They're not part of the regex syntax, so cannot be nested within regexes. They're more like guards in pattern matching. Similar to how each alternative in pattern matching can have one guard, each rule can have one right context. When the rule matches, the right context is matched before the rule is successful. If it doesn't match, then the rule is not successful.

I think the implementation is straightforward. Compile all right contexts to separate NFAs/DFAs. Add an optional right context (reference to the state machine initial state) to accepting states. When we're in an accepting state, before updating last_match clone the char iterator and run the machine for the right context. Update last_match only when the right context matches.

@osa1
Copy link
Owner Author

osa1 commented Dec 10, 2021

I started implementing right contexts in #37. NFA simulation is straightforward, but DFA simulation requires some changes in DFA accepting states. Suppose I have these rules:

'a' > 'a' = 1,
'a' > 'b' = 2,
'a' > 'c' = 3,
...
'a' = 27,

(> is the right context syntax)

In the NFA we will have 27 epsilon transitions to each of these rules. First 26 rules will have a right context. The last rule won't have a right context.

When the input is "a", we will try each of these 26 right contexts, and finally try the last rule without the right context, which will match.

In the DFA, without right contexts, in an accepting state we simply update last_match and that's it. But with right contexts, when we have multiple right contexts after a prefix, we need to try each of those right contexts until we find one that matches. This means generalizing DFA accepting states a little bit to store a list of (right context, semantic action idx) pairs, instead of just a single semantic action index.


One design question here is what to do when we have something like

'a' = 1,
'a' > 'a' = 2,

and the input is "aa". We have two options:

  1. Both rules yield the same match, and first one has priority, so return [1, 1]
  2. Second rule scans a longer substring, so per "longest match" rule return [2, 1]

I feel like (2) will be more intuitive.

@osa1 osa1 closed this as completed in #37 Jan 31, 2022
osa1 added a commit that referenced this issue Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design feature New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant