Skip to content

tanloong/pytregex

Repository files navigation

Python Version from PEP 621 TOML license

Tregex is the Java program for identifying patterns in constituency trees. PyTregex provides similar functionality in Python.

Usage

Command-line

Install it with pip install and run it by python -m pytregex.

$ pip install pytregex

$ echo '(NP(DT The)(NN battery)(NN plant))' | python -m pytregex pattern 'NP < NN' -filter
# (NP (DT The) (NN battery) (NN plant))
# (NP (DT The) (NN battery) (NN plant))
# There were 2 matches in total.

$ echo '(NP(DT The)(NN battery)(NN plant))' > trees.txt
$ python -m pytregex pattern 'NP < NN' ./trees.txt
# (NP (DT The) (NN battery) (NN plant))
# (NP (DT The) (NN battery) (NN plant))
# There were 2 matches in total.

$ python -m pytregex pattern 'NP < NN' -C ./trees.txt
# 2

$ python -m pytregex pattern 'NP < NN=a' -h a ./trees.txt
# (NN battery)
# (NN plant)
# There were 2 matches in total.

$ python -m pytregex explain '<'
# 'A < B' means A immediately dominates B

$ python -m pytregex pprint '(NP(DT The)(NN battery)(NN plant))'
# NP
# ├── DT
# │   └── The
# ├── NN
# │   └── battery
# └── NN
#     └── plant

Inline

from pytregex.tregex import TregexPattern

tre = TregexPattern("NP < NN=a")
matches = tre.findall("(NP(DT The)(NN battery)(NN plant))")
handles = tre.get_nodes("a")
print("matches nodes:\n{}\n".format("\n".join(str(m) for m in matches)))
print("named nodes:\n{}".format("\n".join(str(h) for h in handles)))

# Output:
# matches nodes:
# (NP (DT The) (NN battery) (NN plant))
# (NP (DT The) (NN battery) (NN plant))
#
# named nodes:
# (NN battery)
# (NN plant)

See tests for more examples.

Differences from Tregex

Tregex is whitespace-sensitive, it distinguishes between | and ␣|␣. PyTregex ignores whitespace and has different symbols taking the place of ␣|␣.

Tregex PyTregex
node disjunction A|B A|B
A␣|␣B
condition disjunction A<B␣|␣<C A<B␣||␣<C
A<B||<C
expression disjunction A␣|␣B N/A
expression separation N/A A;B
A␣;␣B

In the table above the difference between expression disjunction and expression separation is whether "expressions stop evaluating as soon as the result is known." For example, in Tregex NP=a | NNP=b if NP matches, b will not be assigned even if there is an NNP in the tree, while in PyTregex NP=a ; NNP=b assigns b as long as NNP is found regardless of whether NP matches.

Missing features

Backreferencing

$ tree='(NP NP , NP ,)'
$ pattern='(@NP <, (@NP $+ (/,/ $+ (@NP $+ /,/=comma))) <- =comma)' 

$ echo "$tree" | tregex.sh "$pattern" -filter -s 2>/dev/null
# (NP NP , NP ,)

$ echo "$tree" | python -m pytregex pattern "$pattern" -filter
# (@NP <, (@NP $+ (/,/ $+ (@NP $+ /,/=comma))) <- =comma)
#                                              ˄
# Parsing error at token '='

Headfinders

PyTregex currently has only one HeadFinder which is for English. If your patterns are for trees of other languages and contain <#, >#, <<#, or >>#, they may not work as expected.

Variable groups

$ tree='(SBAR (WHNP-11 (WP who)) (S (NP-SBJ (-NONE- *T*-11)) (VP (VBD resigned))))' 
$ pattern='@SBAR < /^WH.*-([0-9]+)$/#1%index << (__=empty < (/^-NONE-/ < /^\*T\*-([0-9]+)$/#1%index))' 

$ echo "$tree" | tregex.sh "$pattern" -filter 2>/dev/null
# (SBAR
#   (WHNP-11 (WP who))
#   (S
#     (NP-SBJ (-NONE- *T*-11))
#     (VP (VBD resigned))))

$ echo "$tree" | python -m pytregex pattern "$pattern" -filter
# Tokenization error: Illegal character "#"

Acknowledgments

Thanks Galen Andrew, Roger Levy, Anna Rafferty, and John Bauer for their work on Tregex. One-third of PyTregex's code is just translated from Tregex.

This program uses David Beazley's PLY(Python Lex-Yacc) for pattern tokenization and parsing.

About

Tregex-like constituency tree matcher

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published