Skip to content

Latest commit

 

History

History
240 lines (209 loc) · 11.5 KB

README.md

File metadata and controls

240 lines (209 loc) · 11.5 KB

Antelope

Antelope is a parser generator that can generate parsers for any language*. In the sense of actually creating a parser, it works kind of like [Bison][bison] - you give it an input file, say, language.ace, and it generates a parser for it, in, say, language.rb. Only, instead of Bison's support only for C, C++, and Java, Antelope is meant to generate parsers for multiple languages. Antelope is also written in Ruby for understandability.

Enough about that, though, let's get into Antelope.

Installation

Since you'll only typically use Antelope from the command line, I suggest you install it like so:

$ gem install antelope

If, however, you plan on using it in an application, or need it as a part of a library, you can add gem "antelope" to your Gemfile.

Usage

Antelope is fairly simple to use; you define an .ace file, compile it, and use the proper API for the generated language.

How Antelope Works

Before getting into Ace files, however, you have to understand how Antelope works. Antelope generates a LALR(1) parser, like Bison; for your benefit, however, there are some terms here to understand:

  • LL: A type of parser. If followed by parenthesis, the number (or letter) in the parenthesis denotes the number of tokens of lookahead; i.e., LL(0) has 0 tokens of lookahead. Standing for L eft to Right, L eftmost Derivation, these tend to be handwritten as Recursive Decent parsers. LL parsers can be represented normally by a set of productions. LL parsers are Top-Down parsers, meaning they start with only the Starting symbol and try to match the input. A look at an LL parser is given [here][ll-parser]; check out [this post][tumblr-ll-parser] for more information about LL parsers.
    Antelope does not generate LL parsers.

  • Production: In parsing, a production associates a nonterminal with a string of nonterminals and terminals. It is said that the string of nonterminals and terminals reduces to the nonterminal.
    Productions take the form of A -> y, with A being the left hand side (and the nonterminal), and y being the string.

  • Starting symbol: In parsing, it is the nonterminal that is used to represent any kind of valid input.

  • Symbol: A nonterminal or a terminal.

  • Nonterminal: In parsing, a nonterminal is an abstraction used to represent any number of strings of nonterminals and terminals.

  • String: In parsing, an ordered set of symbols.

  • Terminal: In parsing, it is a concrete value given by the lexer.

  • LR: A family of types of parsers. If followed by parenthesis, the number (or letter) in the parenthesis denotes the number of tokens of lookahead; i.e., LR(0) has 0 tokens of lookahead. Standing for L eft to Right, R ightmost Derivation, they tend to be more complicated than their LL brethren. LR parsers work by starting with the entire input and finding Handles from that input, eventually ending up at the Starting symbol. LR parsers typically do this by splitting the input into two parts: the stack, and the input. LR(0), LR(1), SLR(1), and LALR(1) are all examples of LR parsers.

  • Handle: In a LR parser, it is the Leftmost complete cluster of leaf nodes (in a representative AST). When a handle is found, a reduction is performed.

  • Stack: Initially empty, it can contain any symbol, and is primarily used to represent what the parser has seen. Finding handles will purely occur at the top of the stack.

  • Reduction/Reduce: In a LR parser, this is an action corresponding to replacing the right side of a production with its left side. This purely occurs at the top of the stack, and correlates to finding a Handle.

  • Input: Initially containing the full input, it can contain only terminals; it primarily contains what the parser has yet to see.

  • LR(0): In parsing, it is a type of LR parser that uses no lookahead.
    It essentially uses a deterministic finite automaton to find possible handles. It does no checking to make sure that the possible handles are legitimate.

  • SLR(1): A part of the LR family of parsers, it upgrades LR(0) by checking to make sure that the reduction that it will make (as a part of finding a handle) is valid in the context; basically, for every reduction that it can make, it defines a set of terminals that can FOLLOW the corresponding nonterminal.

  • FOLLOW(A) set: In parsing, it defines a set of terminals that can follow the nonterminal A anywhere in the grammar.

  • LALR(1): A part of the LR family of parsers, it upgrades SLR by using a more precise FOLLOW set, called LA.

  • LA set: LA(q, A -> y) = { t | S =>* aAtw and ay reaches q }

  • Panic mode: In parsing, this is the mode that a parser can go in for recovery, if it encounters a terminal that it was not expecting. In panic mode, the parser pops terminals off of the input until it reaches a valid synchronization token. In order to utilize panic mode, at least one production must have the special error terminal in it. If the parser encounters an error, it will attempt to find a production it can use to resynchronize; if it cannot resynchronize, it will error. It then attempts to resynchronize by continuously pop terminals off of the input and discarding them, attempting to find a synchronization token. A synchronization token is a token that follows an error terminal.

  • Shift/reduce conflict: This occurs when the parser is unable to decide if it should shift the next token over from the input to the stack, or to reduce the top token on the stack. If a shift/reduce conflict cannot be solved by changing the grammar, then precedence rules may be used (see examples/example.ace).

  • Reduce/reduce conflict: This occurs when the parser is unable to decide which production to reduce. This cannot be solved by precedence.

  • Precedence: In some grammars, the Antelope runs into Shift/reduce conflicts when attempting to construct a parser. To resolve these conflicts, Antelope provides precedence declarations. Precedence is separated into levels, which each have a type; levels can be left-associative, right-associative, or non-associative. The higher the level, the higher the precedence. Think of the Order of Operations here; the operations multiply and divide are left associative, and on a higher level than add and subtract, which are still left-associative:

      MULTIPLY, DIVIDE (left-associative)
      ADD, SUBTRACT (left-associative)
    

    Exponentiation, however, is right-associative, and is higher than MULTIPLY or DIVIDE; basically, 2**2**2 would be parsed as 2**(2**2), instead of the left-associative (2**2)**2. For an example of a grammar that uses precedence, see examples/example.ace.

Defining the Ace file

The Ace file format is very similar to Bison's y files; this was intentional, to make transitions between the two easy. The Ace file should be formatted like so:

<directives>
%%
<rules>
%%
<code>

Both %% (internally called content boundaries) are required; the minimum file that is technically accepted by Antelope is therefore two content boundaries separated by a newline.

In the <directives> section, there can be any number and combinations of code blocks and directives. Code blocks are blocks of code delimited by %{ and %}, with the ending delimiter on its own line. These are copied into the output of the file directly. Directives tell Antelope information about the grammar.
An example directive would be the token or terminal directive; this lets Antelope know that a terminal by the given name exists.
Directives take the form %<name> [<value>]*, with <name> being the directive name, and <value> being a string delimited by braces, angle brackets, quotes, or nothing at all. An example of a directive would be %token ADD "+". The available directives are determined by the code generators available to Antelope at the time that the Ace file is being compiled. Some directives, however, are always available:

  • require (1 argument): This makes Antelope check its version against the first argument to this. If the versions do not match, Antelope will raise an error and fail to parse the file. It is recommended to at least require the minor version of Antelope (i.e. %require "~> 0.1").
  • token, terminal (1-2 arguments): Defines a terminal. The first argument defines its name; the second argument defines its value.
    Its value isn't used anywhere but the .output file, to make it easier to read.
  • left, right, nonassoc (1+ arguments): Defines a precedence level, and sets the type of the level based on the directive name used.
  • type: The code generator to use. Currently, the possible values for this can be null, ruby, and output.
  • define (1+ arguments): Sets a key to a value. This would do the exact same thing that using the key as a directive would do, i.e. %define something "value" does the same thing as %something "value". (note: This is not entirely true. If the key were one of the above, it would most likely raise an error, complaining that there is no directive named that.)
  • panic-mode (0-1 arguments): Enables/disables panic mode being put in the output code. Not included by default, but should be.

In the <rules> section, there can be any number of rules (which are definitions for productions). Rules have this syntax:

<head>: <body> ["|" <body>]* [";"]

With <head> being the nonterminal that the production(s) reduce to, and <body> being one or more symbols followed by an optional block that is executed when is a reduction is made using that production. A semicolon terminating the rule is optional. Rules are what make up the grammar. error, nothing, and ε are all special symbols; the first one defines the special error terminal (used for panic mode, ignored otherwise), whereas the second two are used to literally mean nothing (i.e., the rule reduces to nothing). It is not always a good idea to use the nothing symbol, since most rules can be written without it.

In the <code> section, custom code used to wrap the generated parser can be placed. In order to embed the generated parser, you must place %{write} where you want the generated parser.

Compiling the Ace file

Compiling the Ace file is somewhat straightforward; antelope compile /path/to/file.ace will cover most use cases. If you want to override the type in the Ace file, you can use the --type= command option. If it is giving an error, and you're not sure what's causing it, you can use the --verbose command option to see a backtrace.

By default, Antelope always includes the Output generator as a part of the output. This means that an .output file will always be generated along with any other files. The .output file contains information about the parser, like the productions that were used, precedence levels, states, and lookahead sets.

Language API

todo.

Contributing

  1. Fork it (https://github.com/medcat/antelope/fork)
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Commit your changes (git commit -am 'Add some feature')
  4. Push to the branch (git push origin my-new-feature)
  5. Create a new Pull Request

* Only if there's a generator for it. [bison]: http://www.gnu.org/software/bison/ [ll-parser]: http://i.imgur.com/XhJKrDW.png [tumblr-ll-parser]: http://redjazz96.tumblr.com/post/88336053195/what-antelope-does-and-what-i-hope-it-will-do-part