Skip to content

Latest commit

 

History

History
516 lines (397 loc) · 17.9 KB

datalogParsing.md

File metadata and controls

516 lines (397 loc) · 17.9 KB
id narrative-schemas elm
litvis
../schemas/tutorial.yml
dependencies
elm/parser
latest

@import "../css/tutorial.less"

import Dict exposing (Dict)
import Parser as P exposing ((|.), (|=), Parser)
import Set exposing (Set)

This is one of a series of 'data' tutorials for use with litvis.

  1. Parsing structured text
  2. Parsing unstructured text
  3. Parsing CSV
  4. Parsing datalog
  5. Reporting helpful parsing errors, part 1
  6. Reporting helpful parsing errors, part 2

Using Elm Parser to import datalog

Thanks to Brian Hicks for inspiration for this tutorial, and in particular sharing bad-datalog.

{(infobox|}

New Elm parser functions introduced:

  • extracting words with conditions using variable
  • parsing a sequence of items with sequence
  • using previously parsed content with andThen
  • explicitly generating a parser problem with problem

{|infobox)}

datalog is a declarative logic programming language that can represent facts and rules about those facts. It provides a way of making deductive database queries cleanly and aligns well with a functional approach to programming (see this introduction for an explanation and examples).

Syntactically, it is a subset of the Prolog language, with a relatively straightforward grammar. Let's consider how we might build a parser to interpret datalog programs.

The datalog grammar

A datalog program consists of a set of statements that are either facts or rules.

Here are some example facts:

bird("parrot").
legs("cat",4).
move("bat","fly").
move("parrot","fly").

A fact is a named relation (bird or legs or move in these examples) followed by a list of zero or more constants ("parrot", "cat", 4, "bat" and "fly" in these examples). Semantically, we can choose how we interpret these relations although a fact always holds true. Sensible naming should make it obvious what that truth represents (a parrot is a bird; a cat has four legs; a bat can move by flying; a parrot can move by flying).

Here is a sample rule:

flyingBird(X) :- bird(X), move(X,"fly").

A rule allows us to infer new facts from some existing facts. The :- symbol indicates right-to-left logical implication and can be read as 'if'. Any unquoted term starting with an upper-case character in the rule is considered a variable. The statement above can be read as "X is a flying bird is true if it is known that X is a bird is true and X can move by flying is true". Or more concisely "X is a flying bird if it is a bird and it can move by flying". Note that the bird and move components of the rule are combined with a logical AND (conjunction).

If a datalog program were to comprise the four facts and one rule above, it would generate the additional inferred fact:

flyingBird("parrot").

A rule will always have a head (flyingBird(X) in this example) before the :- symbol and a body (bird(X), move(X,"fly"). in this example) following it. Values in brackets after a relation can be either variables (e.g. X) or constants (e.g. "fly") and are referred to as terms. A relation and its terms make up an atom.

We will also consider a widely used modification of the datalog grammar that allows atoms in the tail of a rule to be negated, that is, optionally to be preceded by the word not. For example,

flightlessBird(X) :- bird(X), not move(X,"fly").

To guide the building of a datalog parser it is useful to fully specify the grammar of any datalog program:

<program> ::= <fact> <program> | <rule> <program> | ɛ
<fact> ::=  <relation> "(" <constant-list> ")."
<rule> ::= <atom> ":-" <atom-list> "."
<atom> ::= <relation> "(" <term-list> ")"
<negatable-atom> ::= <atom> | "not" <atom>
<atom-list> ::= <negatable-atom> | <negatable-atom> "," <atom-list> <atom-list>
<term> ::= <constant> | <variable>
<term-list> ::= <term> | <term> "," <term-list>
<constant-list> ::= <constant> | <constant> "," <constant-list>

Note that because a rule's head is an atom and its tail may contain atoms, we can define rules recursively. And we can do so without the risk of infinite recursion (technically, datalog is a total functional language). For example the following fact and rule are valid:

brexit("UK").
meansBrexit(Country) :- meansBrexit(Country).

despite the meansBrexit rule adding no new inferred facts.

More usefully, recursive rules can model networks as graphs. When we have multiple rules with the same relation name, they are combined with logical 'OR' (disjunction). For example, here is a simple undirected graph:

connected("Windermere","Ambleside").
connected("Ambleside","Grasmere").
connected("Grasmere","Keswick").

connected(A,B) :- connected(B,A).
canTravel(A,B) :- connected(A,B).
canTravel(A,C) :- connected(A,B), canTravel(B,C).

which would generate the additional inferred facts:

canTravel("Windermere", "Windermere").
canTravel("Windermere", "Ambleside").
canTravel("Windermere", "Grasmere").
canTravel("Windermere", "Keswick").
canTravel("Ambleside", "Windermere").
canTravel("Ambleside", "Ambleside").
canTravel("Ambleside", "Grasmere").
canTravel("Ambleside", "Keswick").
canTravel("Grasmere", "Windermere").
canTravel("Grasmere", "Ambleside").
canTravel("Grasmere", "Grasmere").
canTravel("Grasmere", "Keswick").
canTravel("Keswick", "Windermere").
canTravel("Keswick", "Ambleside").
canTravel("Keswick", "Grasmere").
canTravel("Keswick", "Keswick").

Let's express the elements of the grammar as Elm types:

type Statement
    = Fact ( Relation, List Constant )
    | Rule ( Atom, List NegatableAtom )


type alias Relation =
    String


type Constant
    = Str String
    | Num Int


type alias Atom =
    ( Relation, List Term )


type NegatableAtom
    = Atom Atom
    | NotAtom Atom


type Term
    = Variable String
    | Constant Constant

This would allow us to express a datalog program with type safety:

exampleProg : List Statement
exampleProg =
    [ Fact ( "bird", [ Str "parrot" ] )
    , Fact ( "legs", [ Str "cat", Num 4 ] )
    , Fact ( "move", [ Str "bat", Str "fly" ] )
    , Fact ( "move", [ Str "parrot", Str "fly" ] )
    , Rule
        ( ( "flyingBird", [ Variable "X" ] )
        , [ Atom ( "bird", [ Variable "X" ] )
          , Atom ( "move", [ Variable "X", Constant (Str "fly") ] )
          ]
        )
    ]

Specifying datalog programs explicitly this way would be somewhat tedious, which is why we wish to create a parser to convert raw datalog text into its equivalent Program.

Building Parsers from Bottom-Up

For more complex grammars like this, it is often easier to build the parsers from 'bottom-up' – that is to consider the parsers that handle the lowest level input (e.g. numbers or strings) first and then the parsers that assemble these low level parsers afterwards.

Constants

Constants can be one of a numeric value, single word starting with a lowercase letter, or some quoted text (that may start with an uppercase latter and contain non-alphabetic characters).

A numeric constant, that may be positive or negative, can be found in a similar way to previous tutorials, in this case placing it in our Num custom type:

numConstant : Parser Constant
numConstant =
    P.oneOf
        [ P.int
        , P.succeed negate
            |. P.symbol "-"
            |= P.int
        ]
        |> P.map Num

For string constants we can use Elm parser's variable that allows us extract a word from input, but impose some conditions on what words are valid. In this case we can specify it must start with a lowercase letter, the rest of the word can be any alphanumeric character or an underscore, but it must not be the word not (which we will reserve for negating an atom) before wrapping it in a Str

strConstant : Parser Constant
strConstant =
    P.variable
        { start = Char.isLower
        , inner = \c -> Char.isAlphaNum c || c == '_'
        , reserved = Set.singleton "not"
        }
        |> P.map Str

A quoted constant can be any text as along as it starts and ends with double quotation marks. We remove the quotation marks themselves with String.slice before returning the constant value:

quotedConstant : Parser Constant
quotedConstant =
    P.succeed ()
        |. P.token "\""
        |. P.chompWhile ((/=) '"')
        |. P.token "\""
        |> P.getChompedString
        |> P.map (String.slice 1 -1 >> Str)

The three constant variants can be combined in a general constant parser.

constant : Parser Constant
constant =
    P.oneOf [ numConstant, strConstant, quotedConstant ]

Variables

Datalog variables can be similarly parsed with variable, this time specifying that they must start with an uppercase letter.

variable : Parser Term
variable =
    P.variable
        { start = Char.isUpper
        , inner = \c -> Char.isAlphaNum c || c == '_'
        , reserved = Set.empty
        }
        |> P.map Variable

Relations

Relation names are simply lower case words, so again we can use Elm parser's variable (not to be confused with the concept of the datalog variable).

relation : Parser Relation
relation =
    P.variable
        { start = Char.isLower
        , inner = \c -> Char.isAlphaNum c || c == '_'
        , reserved = Set.singleton "not"
        }

Assembling relations of constants and variables

Now that we can identify the lowest level of data with parsers, lets consider how we can group them together.

A relation should be followed by a series of terms (constants and variables) which together form a valid atom:

term : Parser Term
term =
    P.oneOf
        [ variable
        , constant |> P.map Constant
        ]

An atom is a named relation of terms where a list of terms is comma-separated and enclosed in parentheses. We can handle this compactly using the sequence parser designed for parsing lists of items.

atom : Parser Atom
atom =
    P.succeed Tuple.pair
        |. P.spaces
        |= relation
        |. P.spaces
        |= P.sequence
            { start = "("
            , item = term
            , separator = ","
            , end = ")"
            , spaces = P.spaces
            , trailing = P.Forbidden
            }

It is possible that in the body of rule, an atom may be negated. A negated version of an atom will be preceded by not:

negatedAtom : Parser NegatableAtom
negatedAtom =
    P.succeed NotAtom
        |. P.spaces
        |. P.symbol "not"
        |. P.spaces
        |= atom

And a negatable atom will be one of the two types of atom:

negatableAtom : Parser NegatableAtom
negatableAtom =
    P.oneOf
        [ negatedAtom
        , atom |> P.map Atom
        ]

We now have the parsers necessary to assemble into rule and fact parsers.

Fact and Rule Parsing

One of the parsing challenges we face when parsing input characters sequentially is that when we identify a relation and list of terms, we don't know whether it will form a fact or rule until we find either a . or a :- following it. And should it contain any variables, this would be fine if the head of a rule, but would be a problem as a definition of a fact – something we can only establish after we've parsed the atom and the following symbol.

Elm parser does have the ability to make a parser backtrackable which could provide a means to overcome this problem. But where possible it is good practice to avoid backtracking to make the parser as efficient as possible (we only want to parse each character once if we can). Avoiding backtracking also helps when reporting errors, which we will consider in the next chapter.

Let's therefore consider an approach that allows just a single pass through our input. Firstly, we parse the atom we expect at the start of any statement. And then we can decide on whether the atom forms part of a fact or a rule. If it is a fact we need to check that the atom contains only constants. To do this we use the Elm parser function andThen. This function takes some parsed content to generate a new parser allowing us to validate the content even after a succeeding parse.

statement : Parser Statement
statement =
    P.succeed identity
        |= atom
        |. P.spaces
        |> P.andThen (\atm -> P.oneOf [ fact atm, rule atm ])

Our fact parser takes a successfully parsed atom and then checks it is followed by a terminating . and that the list of terms in the atom are all constants. We can force the parser to fail if it finds anything other than constants in the atom by calling problem with a suitably explanatory error message.

fact : Atom -> Parser Statement
fact ( r, terms ) =
    let
        constants =
            List.filterMap
                (\t ->
                    case t of
                        Constant c ->
                            Just c

                        _ ->
                            Nothing
                )
                terms

        onlyConstants =
            if List.length constants == List.length terms then
                P.succeed (Fact ( r, constants ))

            else
                P.problem "A fact should only contain constants"
    in
    P.succeed identity
        |. P.symbol "."
        |. P.spaces
        |= onlyConstants

Parsing the body of a rule, which should be list of negatable atoms following the := symbol, is simplified by using the sequence parser. Unlike the fact parsing we will only perform syntactical checking of the rule as terms can be both constants and variables.

rule : Atom -> Parser Statement
rule head =
    P.succeed (\bdy -> Rule ( head, bdy ))
        |= P.sequence
            { start = ":-"
            , item = negatableAtom
            , separator = ","
            , end = ""
            , spaces = P.spaces
            , trailing = P.Forbidden
            }
        |. P.symbol "."
        |. P.spaces

Top-Level Parsers

Finally we can create a top-level program parser that simply loops through input parsing statements until the end of input is reached.

program : Parser (List Statement)
program =
    let
        programHelp statements =
            P.oneOf
                [ P.succeed (statements |> List.reverse |> P.Done)
                    |. P.end
                , P.succeed (\st -> st :: statements |> P.Loop)
                    |= statement
                ]
    in
    P.loop [] programHelp

And to run our parser, we will for the moment, convert the parsed result into a Maybe.

parse : Parser a -> String -> Maybe a
parse parser =
    P.run parser >> Result.toMaybe

Conclusions

This chapter has considered how we translate a well-specified grammar into a parser that transforms some input text into custom types that represent that grammar. It introduced an important addition to the approaches we can take to parsing, namely the use of andThen to validate some parsed content after, rather than during, it has been parsed. The chapter provides practice in applying the parser combinator approach and lays the ground for considering one of the main strengths of the Elm Parser, that of robust error handling to provide useful feedback on the parsing process.


Appendix: Testing

Simple Example

^^^elm r=(parse program testInput1)^^^

Example with multiple rules and negation

^^^elm r=(parse program testInput2)^^^

Network Example

^^^elm r=(parse program testInput3)^^^


Appendix: Test Programs

testInput1 : String
testInput1 =
    """
bird("parrot").
legs("cat",4).
move("bat","fly").
move("parrot","fly").

flyingBird(X) :- bird(X), move(X,"fly").
"""
testInput2 : String
testInput2 =
    """
bird("parrot").
bird("emu").
mammal("cat").
mammal("dolphin").
mammal("bat").
mammal("human").
legs("cat",4).
legs("parrot",2).
legs("bat",2).
legs("human",2).
legs("dolphin",0).
move("parrot","fly").
move("parrot","walk").
move("emu","walk").
move("cat","walk").
move("bat","fly").
move("dolphin","swim").
move("human","walk").

flyingAnimal(Animal) :-
  move(Animal,"fly").

bipedalMammal(Animal) :-
  mammal(Animal),
  legs(Animal,2).

flightlessBird(Animal) :-
  bird(Animal),
  not move(Animal,"fly").
"""
testInput3 : String
testInput3 =
    """
connected("Windermere","Ambleside").
connected("Ambleside","Grasmere").
connected("Grasmere","Keswick").

connected(A,B) :- connected(B,A).
canTravel(A,B) :- connected(A,B).
canTravel(A,C) :- connected(A,B), canTravel(B,C).
"""