Parsec 4 is an update to Parsec 3.3 by He Tao. The changes fall into two groups, the minor updates and the expansions, with one behavior change. I have also provided a sketch of how to use Parsec 4 for anyone just getting started with parsing.
NOTE: This section will only make sense to users of Parsec 3.3
The string
parser in Parsec 3.3 tried to match the first n characters
of the text shred to the parser's datum. This leads to some unexpected
behavior when there is a partial match of n characters where n is less
than the length of either the text shred or the target.
For example, suppose your string parser is looking for bayesian
and
the input text is bayside
. In Parsec 3.3, the return value indicates
failure, but the index is advanced to the position after the y
. Is this
the desired behavior? Maybe, but most programmers will guess incorrectly
that a failure does not consume the bay
from the input.
There are now two string
parsers, parsec3_string
and
parsec4_string
. The first one follows Parsec 3.3 behavior.
The second one does not consume input in the case of a partial
match. Either can be used directly with its name. The string
parser can be associated with either one, and by default it is
associated with parsec4_string
.
Users can select the Parsec 3.3 version by setting the environment
variable PARSEC3_STRING=1
.
I changed name of the any()
parser to any_char()
to avoid conflicts with
Python built-in of the same name. When Parsec 3.3 was imported with
from parsec import *
the declaration/definition of parsec.any
would hide the any
builtin.
In Parsec 4, the parser has been renamed any_char
.
-
Added Explanatory comments to help programmers who are less familiar with the concepts of parsing.
-
Provided more natural English grammar for He Tao's comments. His comments are often included in situ, with additional comments by me. New comments are marked with a preceding and following group,
###
, and the original docstrings use triple single quotes rather than triple double quotes. -
Revised for modern Python; no longer compatible with Python 2. This version requires Python 3.8. Where practical, f-strings are used for formatting, and the walrus operator (
:=
) makes an appearance here and there. -
Inserted type hints.
-
Added a
__bool__
function to the Value class. -
Changed some string searches to exploit constants in string module rather than str functions that might be affected by locale.
-
An explicit
Value.__str__
is now provided so that theValue
objects are more aesthetically pleasing when printed.
NOTE: There are a number of vestigal functions for which I have run across no reason to use. These functions have been left in situ so that existing code that works with Parsec 3.3 will also work with Parsec 4.
At University of Richmond, it is common to use parsec4 for user input processing. I have added:
- Symbolic names for characters commonly involved in parsing.
They are named with standard symbols:
TAB
,NL
,CR
, etc. - Many custom parsers are likely to include parsers for common programming
elements (dates, IP addresses, timestamps). These are now included. Note
the pattern that the core
regex
parsers are in uppercase, and the correspondinglexeme
parser is in lower case.
WHITESPACE = regex(r'\s*', re.MULTILINE)
lexeme = lambda p: p << WHITESPACE
DIGIT_STR = regex(r'(0|[1-9][\d]*)')
digit_str = lexeme(DIGIT_STR)
HEX_STR = regex(r'[0-9a-fA-F]+')
hex_str = lexeme(HEX_STR)
IEEE754 = regex(r'-?(0|[1-9][\d]*)([.][\d]+)?([eE][+-]?[\d]+)?')
ieee754 = lexeme(IEEE754)
IPv4_ADDR = regex(r'(?:(?:25[0-5]|2[0-4][\d]|[01]?[\d][\d]?)\.){3}(?:25[0-5]|2[0-4][\d]|[01]?[\d][\d]?)')
ipv4_addr = lexeme(IPv4_ADDR)
PYINT = regex(r'[-+]?[\d]+')
pyint = lexeme(PYINT)
TIME = regex(r'(?:[01]\d|2[0123]):(?:[012345]\d):(?:[012345]\d)')
time = lexeme(TIME)
TIMESTAMP = regex(r'[\d]{1,4}/[\d]{1,2}/[\d]{1,2} [\d]{1,2}:[\d]{1,2}:[\d]{1,2}')
timestamp = lexeme(TIMESTAMP)
US_PHONE = regex(r'[2-9][\d]{2}[ -]?[\d]{3}[ -]?[\d]{4}')
us_phone = lexeme(US_PHONE)
There is a pre-existing ParseError
that is reserved for use by Parsec,
and it is raised when Parsec cannot continue.
I have included two Exception
classes that are identical except
for name: EndOfGenerator
and EndOfParse
. Each is derived from
StopIteration
. Each returns a Value
object, and you can write
a try
block to accept either one, both, or if you don't care, then use:
try:
something()
except StopIteration as e:
...
The meaning of the exception is up to you.
There are two completely new functions.
-
ascii_letter
has been added to restrict the definition of the a letter to[a-zA-Z]
. Parsec 3.3 has aletter
parser that succeeds if the character is anything the Unicode standard recognizes as a letter. -
parser_from_strings
is a factory method to create a sequence of parsers for each word in a white space delimited string or a sequence of strings. This is a fairly common need in the uses I have had for Parsec. For example,
p = parser_from_strings("hello world")
produces:
p = lexeme(string("hello")) ^ lexeme(string("world"))
Note that the result is a parser, p
, that has been created by joining
two parsers with the try-choice operator.
Parsec is best used as a kit for constructing parsers of your own for your purposes. It is a bit like the toys known as LEGOs: they are the basic building blocks with a finite number of shapes, and from them you build toy houses. The toy house is analogous to the top-level parser, and the LEGO blocks are the monads.
Each parser has a .parse()
function, and since new parsers are constructed
by combining the monadic parsers, your new parsers also have a .parse()
method. Let's say your top-level parser is named p
, then to parse
some text there will be a line somewhere in your program that looks like
this:
results = p.parse(text)
p
should return the objects retrieved from parsing text
. The simplest
parser of all might be to read text one word at a time, where word has the
conventional meaning in English.
The following code will import everything from the parsec4
module. The usual
advisories regarding importing contents of a module rather than the module itself
apply.
import parsec4
from parsec4 import *
A parser that retrieves a whitespace delimited word could be written like this:
p = parsec4.regex(r'[a-zA-Z]+')
Note that p
is not the word that is retrieved. Instead, p
is a parser that retrieves a word based on the regular expression
supplied to the function parsec4.regex
. Every parser has a parse()
function, and calling p
with a piece of whitespace delimited text
returns the first word of the text.
>>> p.parse('Your name')
Your
Of course, p
is not enough to parse an entire language; it is a bare
minimum parser that merely gets the first word from a whitespace string.
Parser terminology is unfamiliar to many, and even when some of the terms are known, it is not clear precisely what they mean. Below is a short list of definitions that are less rigorous than what you might find in a computer science text, but sufficiently exact to be useful.
The order of these definitions is intended to reflect the order that the terms make sense, and all the examples that support the definitions come from the Python language.
text
--- The input for the parser represented as a sequence of
bytes. Whether the text is read from a file or typed in by a user is
unimportant.
index
--- The index is nothing more than the current position of the
"next" byte in the text. In Parsec, the initial index is usually zero (0).
The bytes that are already parsed are inaccessible.
shred
--- A non-empty sequence within the text, possibly all that
remains, and that has not yet been parsed.
lexeme
--- The smallest collection of adjacent bytes that has a meaning
to the parser. It could be just one byte, like =
, or it could be a
a_very_long_variable_name
.
monad
--- A parser for a lexeme. Each lexeme has a parser that produces
it by a combination of [1] operations that read the input and [2] produce the lexeme.
The two actions are said to be
bound within the monad.
token
--- Tokens are the defining concept in parsers; quite literally
a parser transforms a stream of text into a stream of tokens. A token
is a collection of one or more lexemes that are adjacent in the text,
combined with a unit of meaning in the language being parsed. The type
of meaning defines a class of tokens, such as operator, identifier, etc.
A token may have different meanings in different languages;
the <
represents "less than" in Python, but is the opening of a
tag in HTML.
They may also influence each other's meanings:
<
a comparison operator token meaning "less than."=
an assignment operator token.<=
less than or equal to< =
a syntax error because an embedded space is not allowed.
expression
--- A group of adjacent tokens that can be combined
and evaluated. A trivial but useful example is 2 + 3
.
sequence point
--- A token that forces the preceding expression
to be evaluated. Examples are line breaks in Python, semicolons in
C and C++, and parentheses in almost all languages.
statement
--- An expression combined with a terminating sequence point.
Parsec is not itself a parser for any language; it is a parser construction kit, which is based on the idea of monads. Each monad, whether it is one provided within Parsec or something you write or create by combining Parsec's parts, should do these two actions:
-
Accept arguments that are a
str
that the monad examines, and anint
that represents the offset into thestr
where your parse should begin. This argument defaults to zero, which makes it very convenient to write statements likep.parse("hello world")
instead ofp.parse("hello world", 0)
. -
Returns a
Value
object (defined inparsec.py
) that is a named tuple:(status:bool, index:int, found:object, expected:object)
Both Parsec 3 and Parsec 4 provide convenience factories for the concepts
of success and failure of the parser. Success is Value(True, index, found, None)
and failure is Value(False, index, None, expected)
.
Other than found
and expected
, what do these terms mean and what
values can you expect?
-
If the parsing operation is a success, the parser returns some partial string from the
str
that meets the criteria of the parser, and the index will be greater than or equal to where it was before the parsing operation. -
If the parsing operation fails, it returns what it was looking for, and the index is (again) greater than or equal to the value when the parser began.
In other words, the parser never "rewinds" the input, but it does not always advance the index. If the parser does not advance the index, there could be a few reasons:
- It is performing look-ahead; it is seeing if something is present, but not operating on it.
- There is nothing left (i.e., you have reached the end of the input).
- It failed, and left the index unchanged. This is called "not consuming any input."
Let's revisit the p
parser which only recognizes what most of
us call words.
>>> p.parse('Your name')
Your
In this example, Your name is the text. It is nine characters, and
the index corresponds to the positions used by Python's slicing operator.
Conceptually, p
applies the regex, and extracts "Your". The index is
now set to 4, and the remaining string is " name" (note the leading space).
Assuming that p
is a component of some other parser, it
will send text[4:]
to the next parser to
invoke according to its rules.
Parsec contains a class named ... Parser
. This class has all the
methods in it that support parsing. It also contains a method named
bind
that turns a functional operation into a parser by wrapping the
function in Parser functionality. You do not need to call bind
or
any of the other functions in Parser
if you use the @Parser
class
decorator.
Parsec also contains a function named generate
that is used as a
decorator in front of a function that does the parsing. The @generate
decoration creates a Russian Doll structure of wrappings around the
functions.
Here is an alphabetized list of the most useful built-ins, with a description of what is success, what it returns, and under what circumstances the parser advances the index ("consumes input").
Parser | Description | Success | Returns | Index |
---|---|---|---|---|
eof |
tests for end of a string | index at or beyond the end of the string | boolean | always unchanged |
regex |
regular expression match of a string | matches the regex | first matched string | moved ahead by length of the matched string; unchanged on failure |
As a translation of the Parsec library in Haskell, and getting a breath of new-old life from operator overloading in C++, the combinator operations (combining simple parsers to do more complex things) is done with overloaded operators. There are constraints:
- Python's order of operations is based on the symbols used, not the meanings of the symbols.
- Binary operations remain binary. Uniary operations remain uniary.
- The order of operations can be controlled with parentheses, just like in Python.
The uniary operations are not of much use when "combining" elements,
so it is no surprise that the binary operators are the ones used. Let's
use letters familiar to those of us with a little math background, and we
will call one parser f
and the other g
.
Operation | Meaning |
---|---|
f >> g or f > g |
If f fails, return its failure. Otherwise, ignore the successful result, and execute g . |
f << g |
If f fails, return its failure. Otherwise, execute g . If g succeeds, return the success value of f and discard the success of g . Index is always the index of g . |
f < g |
If f fails, return its failure. Otherwise, execute g . If g succeeds, return the success value of f and discard the success of g . Use the index of f on success of both f and g , but the index of g on failure of g . |
f | g |
If f succeeds, return its success. If f failed and the index has not moved, execute g . If f failed and the index has moved, return the moved index position. |
f ^ g |
If f succeeds, return its success. If f failed, execute g with the same index tried with f . |
f + g |
If f succeeds, aggregate with the result of g as a tuple. If f fails, return its failure. |
f / g |
If f succeeds and g fails, return the result of f . If f fails or g succeeds, then the index is unchanged. The idea is that "f is not followed by g." |
Let's start with the easiest of these to work with: exit
. Just "exit"
all by itself, or in the language of Parsec, ...
exit = string('exit') < eof()
Before we go forward, we must have a clear understanding of the expression
above. Since the type of the expressions on each side of the assignment
operator (=
) must be the same, the left-hand-side (LHS) is a parser,
not something we have parsed. In fact, exit
is a Parsec.Parser
that we built by combining two other Parsec.Parser
objects, [1]
a parser whose successful result is finding the string "exit" and [2]
a parser that expects to find the end of the string.
The most common question I have been asked is "How does exit
parse
anything??" The answer is that it inherits the .parse()
method
because string
and eof
are constructed by the @generate
decorator
that wraps the operations in the Parser
class. The <
operator takes
a parser on both the LHS and RHS and returns a parser that also has a
.parse()
method.
Now that we have a better understanding of the revolutionary syntax, let's see if the above is really what we want. It will fail if the user adds a space, and types "exit " followed by the return key:
>>> exit = string('exit') < eof()
>>> exit.parse('exit')
'exit'
>>> exit.parse('exit ')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/milesdavis/parsec4/parsec4.py", line 290, in parse
return self.parse_partial(text)[0]
File "/home/milesdavis/parsec4/parsec4.py", line 305, in parse_partial
raise ParseError(result.expected, text, result.index)
parsec4.ParseError: expected: ends with EOF at <bound method ParseError.loc of ParseError('ends with EOF', 'exit ', 4)>
We can use the predefined lexeme
parser that vacuums trailing whitespace
to forgive this extra keystroke:
# from parsec4.py
lexeme = lambda p: p << WHITESPACE
>>> exit = lexeme(string('exit')) < eof()
>>> exit.parse('exit ')
'exit'
Sadly, this will still fail if the user types " exit". In this case, another builtin comes to our rescue, and it is almost common-sense readable:
exit = WHITESPACE >> lexeme(string('exit')) < eof()
What else could go wrong? Many users might try leaving the program with quit, so perhaps we should go with:
exit = WHITESPACE >> lexeme(string('exit')) ^ lexeme(string('quit')) < eof()
And there is always someone with the caps-lock engaged, knowingly or
unknowingly, and that user will scream in all caps, EXIT. One choice
in Parsec 4 is to use the parser factory method parser_from_strings
,
and write:
exit = WHITESPACE >> parser_from_strings('exit EXIT quit QUIT') < eof()
Assuming the user is typing in the commands, or that they are being
read from a file, perhaps via I/O redirection, it is fair to ask this
question: "Why not take the user's input string, s
, and place a line
of Python in the function that reads the input that says the following?"
s = s.strip().lower()
This approach is best, and parsers of all types for all languages benefit from well behaved data. It is also a lot simpler than running the text through a number of parsers. And with a nod to the efficiency police, it is no doubt more efficient to use Python's built-ins than it is to use Parsec's functions.
So let's go with the idea that the input has been stripped of tailing
whitespace and pounded down to lower case. What else could Parsec
offer us? Let's consider our desire to treat exit and quit as
the same action. Written as it is above, the Parsec code that says
lexeme(string('exit')) ^ lexeme(string('quit'))
has two results (i.e.,
the text literal "exit" or the text literal "quit") when what we want
is exactly one result that represents the request to exit.
The reason to have only one result may not be obvious, and I will explain it this way: We want our parser to recognize that the user is requesting to exit the program. The rest of our code should not care how the user provided this information to our program. It would be clumsy to execute the parsing code and then check again to see that the user typed either exit or quit when either is satisfactory. If we create a single parser for the exiting-event, we can avoid writing the following code --- code that essentially duplicates our parsing effort.
if user_command in ('quit', 'exit'):
sys.exit(os.EX_OK)
If we are satisfied with "exit" as the representation, then we can write our parsing expression in Parsec's language as
lexeme(string('exit')) ^ lexeme(string('quit')).result('exit')
.result()
is one of two functions that are useful for transforming
the output of the parsing operation while it is happening. The .result()
function provides a new value for a parser that has succeeded. Whatever
the argument to .result()
is, that becomes the result of the parsing
operation.
In fact, the value supplied for result can be anything you think is useful! If you were using a function lookup table, you might want to use something like the following, supplying the Python built-in function that exits any program:
(lexeme(string('exit')) ^ lexeme(string('quit'))).result(sys.exit)
The other widely used transformation is .parsecmap()
. Whereas
.result()
supplies a replacement value, the argument to .parsecmap()
is a function that is applied to the successful result of the parsing; that is
.parsecmap()
uses the value of the parsing, and transforms it
in some way.
The use cases for .parsecmap()
are different, and they fall into two
groups:
- the argument is another function that is a part of the current parser, or ...
- the argument is a native function of Python such as
math.sqrt
.
Let's say you wanted to parse an expression like "sqrt 3". You can go about it this way:
square_roots = lexeme(string('sqrt')) >> lexeme(ieee754).parsecmap(math.sqrt)
The first part recognizes the word sqrt. We only advance in the
square_roots
parser if we find the text "sqrt". lexeme
vacuums up any
trailing whitespace, and then ieee754
looks for a number that meets
the IEEE754 definition of the representation of a floating point value.
NOTE: the ieee754
parser is a new addition to Parsec 4.
If ieee754
finds a number, it invokes math.sqrt
on the number. It
will clearly fail if the IEEE number is negative, a problem that can be dealt with
in a number of ways --- perhaps using cmath.sqrt
as the function.
A key point is that we can discard the "sqrt" text because we only care about the fact that it was found. It is of no use in deciding what to do next.
Suppose you were writing a program to convert csh
to bash
. You will find
that shell and environment variables are introduced with the following syntax
in csh
:
setenv GAUSS_MDEF "$2"
set refextz = "${refext}.gz"
set colon=":"
The equivalent statements in bash
are
export GAUSS_MDEF="$2"
refextz="${refext}.gz"
colon=":"
setenv
is easier, so let's start there. We need a general parser for
identifiers --- tokens made of printable characters that will represent
the names of objects in the program. Depending on the specifics of the
language, this simple parser will do the job:
identifier = lexeme(regex(r'[^\s]+'))
That's not very elegant, so perhaps something like this (depending on specifics of the language) instead:
identifier = lexeme(regex(r'[$_a-zA-Z][a-zA-Z0-9_]*'))
A function to handle the setenv
statement in csh
will look like
the code below, presented with line numbers to aid the discussion that
follows:
1. @generate
2. def setenv():
3. yield lexeme(string('setenv'))
4. k = yield identifier
5. v = yield quoted ^ identifier
6. raise EndOfGenerator(f'export {k}="{v}"')
The decorator in line 1 transforms the ordinary Python function into a
generator/parser. The operations inside setenv()
are bound to a newly
created parser of the same name.
Line 3 recognizes the string literal "setenv" and removes the trailing
space. We don't need to retain anything from the operation because bash
does not use the "setenv" text. The parser will terminate if the string
is not found, and only continues execution in a setenv statement.
Line 4 retrieves the identifier -- in this case, the name of the environment variable.
Line 5 is more complex: it looks to see if the text is quoted, and if it is not assumes that the text is not quoted because it contains no whitespace characters. This is a good example of why the more complex representation must be searched for first.
Line 6 communicates the outcome of our parsing operations back to the
calling parser with the EndOfGenerator
exception. It contains text
in bash
syntax that represents the same information.
Simple set
statements like the ones pictured could be handled similarly,
but if the expression is something like the following, we are going to
need another parser:
set path = (/usr/local/anaconda/anaconda3/bin $path)
The value operation in line 5 will be something like:
v = quoted ^ parenthetical ^ identifier
Note that we still need to check first to see if it is quoted so that we are able to understand expressions like this one, and correctly recognize the value as a string rather than the result of a concatenation expression.
set path = "(/usr/local/anaconda/anaconda3/bin $path)"
How might we handle the parenthetical expression? We obviously need parsers for the delimiters:
lparen = lexeme(string(LPAREN))
rparen = lexeme(string(RPAREN))
The operation inside the parens is string concatentation, and the space in this case is a delimiter that we want to keep around. Let's extract and reuse the regex from the identifier parser, and call it a blurb:
blurb = regex(r'[$_a-zA-Z][a-zA-Z0-9_]*')
Now, we can put together a parser for parenthetical expressions:
@generate
def parenthetical():
yield lparen
elements = yield sepBy(quoted ^ blurb, WHITESPACE)
yield rparen
raise EndOfGenerator(" ".join(elements))
Note the nice way that the result of "combinating" the quoted
and
shred
parsers allows us to create a new (and nameless) parser that
is the argument to the sepBy
parser.