Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement packrat parsing #75

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

poteat
Copy link

@poteat poteat commented Aug 10, 2021

Packrat Parsing Implementation

As discussed in #74, this pull request implements packrat parsing using a internally specified set of memoization functions in ./src/cache.ts.

1. Checklist

  • This PR only introduces changes to documentation.
    • Please include a summary of changes and an explanation
  • This PR adds new functionality
    • Is there a related issue?
      • Please note the related issue below, and how this PR relates to the issue if appropriate
    • Does the code style reasonably match the existing code?
    • Are the changes tested (using the existing format, as far as is possible?)
    • Are the changes documented in the readme with a suitable example?
    • Is the table of contents updated?
  • This PR introduces some other kind of change
    • Please explain the change below

2. Discussion

Packrat parsing [1] is a technique for enabling linear-time backtracking in top-down parsers. A common issue with recursive descent parsing is exponential time blowup, whereby the time it takes to parse a string scales exponentially with respect to input length.

Our implementation here using a memoization wrapper around the incremental state update function for each parser. The full, global cache state is cleared on every run.

[1] Packrat Parsing: a Practical Linear-Time Algorithm with Backtracking (PDF). Bryan Ford, 2002, MIT

2.1. Race Condition Risk

If two instances of run are running in parallel, this could lead to a poisoned cache, and incorrect parse results. Since the parser does not operate on the basis of Promises, it is unclear to me whether this is an important risk.

2.2. Incremental Parsing

This implementation resets the global cache on every run and fork call. This is necessary to avoid errors in parsing - because the cache is parameterized only on state, and not target, the cache can become poisoned unless it is refreshed on every individual run.

On our fork, I think we will attempt to parameterize the state update cache with target, so that the cache may persist across multiple runs, thereby implementing efficient incremental parsing. A configurable LRU cache will be utilized to prevent arbitrary memory usage.

I have avoided implementing this more advanced form in this pull request, as it is a lot more experimental, opinionated, and risky. As yet, I am not convinced it's even possible to do easily - if target always corresponds to do full input, then parameterizing the cache on target won't implement incremental parsing.

3. Suggested release notes

Versioning is left up to maintainer discretion, but for convenience here is my suggested changelog.

## 4.1.0

- Implement Packrat Parsing to improve cyclomatic performance
  - All incremental state update functions are cached
  - The entire cache is cleared before `run` or `fork` begins parsing

4. Questions to resolve:

  • Do we want tests for cache.ts? I did not want to unnecessarily bloat the pull request, but if we want them I can write some.
  • Do we want to wait for incremental linear parsing after all?
  • Should changing the cyclomatic complexity of a function count as a breaking change?
  • Is any additional documentation necessary about this internal design?

@francisrstokes
Copy link
Owner

hey @poteat - thanks for putting this together so quickly! I'm on a little family retreat for a few days, so I'll take a proper look when I get back next week 👍

@poteat
Copy link
Author

poteat commented May 27, 2022

@francisrstokes I'd love to pick this back up at some point!

@rishavs
Copy link

rishavs commented Jul 15, 2022

@francisrstokes and @poteat , this pull has been pending for a while. Can this be merged with the base?
With the kind of performance improvements as mentioned in the parent issue, I think this will be a great win for all arcsecond users.

@francisrstokes
Copy link
Owner

My apologies for not getting back to this PR in a more timely manner (to put it lightly!). I would also love to see the performance benefits provided by this approach make their way into the repo. I do have concerns about the cache though - exactly what you have mentioned @poteat - that it essentially disallows concurrent usage of the library.

To solve this, we need a mechanism where we can have multiple, contextual caches, or at a minimum, the ability to opt-in and out of caching.

Instead of caching at construction time, we might want to have this process take place at parser.run time using a user provided cache object. What do you both think?

@rishavs
Copy link

rishavs commented Jul 15, 2022

I fear I do not understand the idea well enough to have an opinion on a user provided cache object.
About the caching opt-in, I think just a simple caching on/off flag should be a great first step. Based on further user feedback you can then decide the best way to go forward. For now, any way to stop combinatorial explosions would be welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants