How to use `repeat_till` efficiently #561

Enduriel · 2024-07-17T10:39:13Z

Enduriel
Jul 17, 2024

I wanted to write a lexer which matches one of my defined tokens, and otherwise matches a catchall token, while still having a relatively clean codebase. I have currently solved this by doing

fn lex_delimiter<'s>(input: &mut &'s str) -> PResult<Token<'s>> {
	let checkpoint = input.checkpoint();
	let (next_token, full_match) =
		repeat_till(1.., any, alt((lex_non_delimiter.map(|t| t.into()), eof)))
			.map(|((), next)| next)
			.with_recognized()
			.parse_next(&mut *input)?;
	let text = &full_match[0..(full_match.len() - next_token.len())];
	input.reset(&checkpoint);
	*input = &input[text.len()..];
	Ok(Token::Delimiter(Delimiter { text }))
}

The obvious problem is that I'm throwing away the 'till' part of the repeat_till call, resetting input as though it only parsed the first repeat part, and then continuing on. I'm doing this so that it's easy for me to encapsulate the various tokens in different functions (so this one only matches the wildcard), but maybe there's a better way of either structuring this function or organizing my program more generally.

Other potentially relevant parts of the (very simple) code

fn lex_non_delimiter<'s>(input: &mut &'s str) -> PResult<Token<'s>> {
	alt((lex_word, lex_tag, lex_num)).parse_next(input)
}

fn lex_text<'s>(input: &mut &'s str) -> PResult<Vec<Token<'s>>> {
	trace("test", repeat(1.., alt((lex_non_delimiter, lex_delimiter)))).parse_next(input)
}

(I'll probably move away from using the delimiter token as my catchall, but even then I'll still need an unknown token or something like that because I must be able to successfully parse input)

Answered by epage

Jul 17, 2024

Looks like there are two points of complication

You recognized the terminator but don't want to
You want parsing to pick back up at the terminator, rather than after it

You could peek your terminator. It will match but not advance input. This means you will recognize only the parse and allow parsing to pick back up at the terminator.

This will clean up the code but I'm unsure if it will affect performance.

Depending on where you land between performance and clean code, repeat_till (especially per char parse) and all of those alts will kill your performance (along with parsing &str instead of &[u8]).

If possible, try to use dispatch! as a first pass between all of your token types
Try t…

View full answer

epage · 2024-07-17T17:11:49Z

epage
Jul 17, 2024
Maintainer

Looks like there are two points of complication

You recognized the terminator but don't want to
You want parsing to pick back up at the terminator, rather than after it

You could peek your terminator. It will match but not advance input. This means you will recognize only the parse and allow parsing to pick back up at the terminator.

This will clean up the code but I'm unsure if it will affect performance.

Depending on where you land between performance and clean code, repeat_till (especially per char parse) and all of those alts will kill your performance (along with parsing &str instead of &[u8]).

If possible, try to use dispatch! as a first pass between all of your token types
Try to use take_while where possible, instead of repeat. Of course, this doesn't apply for the top-level loop but another approach might if you don't want a Vec<Token> in the end.
If your delimiters are all ASCII, then you can safely parse with &[u8] and use std::str::from_utf8_unchecked

2 replies

Enduriel Jul 17, 2024
Author

Not sure how I missed peek, thanks for pointing that out, that does indeed clean this up pretty much completely.

As for performance, I personally much prefer perfectly 'clean' to less clean but perfectly performant code, so unless I run into performance issues (which tbf I might because the lexer & parser need to be pretty quick), I'm going to stick this out as it is I think.

dispatch! would probably be my easiest optimization, but since I'd have to peek unless I diverge from my one function one token setup the code duplication is annoying.
I do use take_while where applicable but I'm not aware of a way to make it work as a catchall without explicitly listing out the things my other parsers start with, which would also lead to code duplication.
I commented about the word delimiter not really being appropriate, it's more accurately described as Other, though it's mostly delimiters, so I don't want to constrain that in the future, though I am curious why there's such a performance difference between &[u8] and &str

Also thanks for making/forking this crate, and being so helpful on the discussions, it's genuinely been a really nice experience to use and being able to write a lexer and parser with the same crate is super neat.

epage Jul 17, 2024
Maintainer

I am curious why there's such a performance difference between &[u8] and &str

Iterating through the is the input is the most inner / hot loop of parsing. With &[u8], there is no overhead. However, if you parse &str, we call s.chars().

s.chars() will have to

inspect the current byte
check the bit pattern in the byte for how wide the char is
Grab the bytes and convert to a char
advance by the number of bytes consumed.

So long as all delimiters are ASCII, this is the easiest, cleanest performance gain and is very noticeable. For example, I do this in toml / toml_edit. TOML 1.1 will be tricker to support because the current proposal for 1.1 calls for bare words to support some non-ASCII characters. I figure I'll just grab all bytes and, if any are non-ASCII, then convert to &str and check if they are allowed chars. That way I can still be parsing with bytes as much as possible.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use `repeat_till` efficiently #561

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

How to use repeat_till efficiently #561

Enduriel Jul 17, 2024

Replies: 1 comment · 2 replies

epage Jul 17, 2024 Maintainer

Enduriel Jul 17, 2024 Author

epage Jul 17, 2024 Maintainer

How to use `repeat_till` efficiently #561

Enduriel
Jul 17, 2024

Replies: 1 comment 2 replies

epage
Jul 17, 2024
Maintainer

Enduriel Jul 17, 2024
Author

epage Jul 17, 2024
Maintainer