Marking a stream as complete post-parsing #510

beneyal · 2024-04-20T19:09:22Z

beneyal
Apr 20, 2024

Hi again!

I'm facing a problem I'm not sure how to solve with winnow. In libraries like Haskell's attoparsec and Scala's atto, I can take an existing parsing result, and if it is partial, I can feed it more data, thus getting a new result (success, partial, or failure), ad infinitum.

Here's an example of the language I'm parsing:

#1 = Scan Table [ stadium ] Output [ Stadium_ID , Capacity , Name ] ;
#2 = Scan Table [ concert ] Predicate [ Year >= 2014 ] Output [ Stadium_ID , Year ] ;
#3 = Aggregate [ #2 ] GroupBy [ Stadium_ID ] Output [ Stadium_ID , countstar AS Count_Star ] ;
#4 = Join [ #1 , #3 ] Predicate [ #3.Stadium_ID = #1.Stadium_ID ] Output [ #1.Name , #3.Count_Star , #1.Capacity ] ;
#5 = TopSort [ #4 ] Rows [ 1 ] OrderBy [ Count_Star DESC ] Output [ Capacity , Count_Star , Name ]

I split the segments into lines for readability, but they come as one long string with " ; " being a separator.

The input comes from a pre-trained language model, token-by-token, a token being any string.

As the input doesn't have a clear end (I could always add more segments), my parsing result is always Incomplete.

If I mark the stream with .complete(), then I get the parsing result as expected, so I know the parsers are working correctly.

My problem is that I don't know how to say in winnow: "Assume the input you have parsed so far is complete. Is the result a success or a failure?"

I tried using .complete_err(), but since my parser always return Incomplete, I'll always get back an error, even the parsed input so far makes a valid result.

Is there any way to achieve what I'm trying to do?

I'm not sure I managed to explain the problem well enough, so feel free to ask clarifying questions 😃

Thanks! 🙏

P.S.
It took a bit of trial and error and some lifetime annotations, but I must say that working with winnow is pretty fun. And the debug feature flag is a life-saver!

epage · 2024-04-22T15:16:13Z

epage
Apr 22, 2024
Maintainer

I would recommend processing a data frame at a time, rather than trying to have all the frames processed together, incrementally re-parsing only whats needed as more data is added.

This makes it so Incomplete only means "this frame is incomplete" rather than dealing with "the parse result is incomplete" and trying to differentiate an incomplete data frame from that.

You could go a step further. The parsers have slightly different behavior when operating on potentially partial data and this can lead to surprising behavior. You have an unambiguous end-of-frame token, ;, so you could do take_until(" ; ").and_then(frame) so end-of-frame parsing is the only part that deals with Partial Parsing and then the Frame is parsed from Complete Input. If possible, I'd change that to take_until(';') as that will be faster as the search will be looking for a more unique token (; rather than ). If you run cargo add winnow -F simd, this will also be accelerated.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Marking a stream as complete post-parsing #510

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Marking a stream as complete post-parsing #510

beneyal Apr 20, 2024

Replies: 1 comment

epage Apr 22, 2024 Maintainer

beneyal
Apr 20, 2024

epage
Apr 22, 2024
Maintainer