GitHub - tanloong/pytregex: Tregex-like constituency tree matcher

Tregex is the Java program for identifying patterns in constituency trees. PyTregex provides similar functionality in Python.

Usage

Command-line

Install it with pip install and run it by python -m pytregex.

$ pip install pytregex

$ echo '(NP(DT The)(NN battery)(NN plant))' | python -m pytregex pattern 'NP < NN' -filter
# (NP (DT The) (NN battery) (NN plant))
# (NP (DT The) (NN battery) (NN plant))
# There were 2 matches in total.

$ echo '(NP(DT The)(NN battery)(NN plant))' > trees.txt
$ python -m pytregex pattern 'NP < NN' ./trees.txt
# (NP (DT The) (NN battery) (NN plant))
# (NP (DT The) (NN battery) (NN plant))
# There were 2 matches in total.

$ python -m pytregex pattern 'NP < NN' -C ./trees.txt
# 2

$ python -m pytregex pattern 'NP < NN=a' -h a ./trees.txt
# (NN battery)
# (NN plant)
# There were 2 matches in total.

$ python -m pytregex explain '<'
# 'A < B' means A immediately dominates B

$ python -m pytregex pprint '(NP(DT The)(NN battery)(NN plant))'
# NP
# ├── DT
# │   └── The
# ├── NN
# │   └── battery
# └── NN
#     └── plant

Inline

from pytregex.tregex import TregexPattern

tre = TregexPattern("NP < NN=a")
matches = tre.findall("(NP(DT The)(NN battery)(NN plant))")
handles = tre.get_nodes("a")
print("matches nodes:\n{}\n".format("\n".join(str(m) for m in matches)))
print("named nodes:\n{}".format("\n".join(str(h) for h in handles)))

# Output:
# matches nodes:
# (NP (DT The) (NN battery) (NN plant))
# (NP (DT The) (NN battery) (NN plant))
#
# named nodes:
# (NN battery)
# (NN plant)

See tests for more examples.

Differences from Tregex

Tregex is whitespace-sensitive, it distinguishes between | and ␣|␣. PyTregex ignores whitespace and has different symbols taking the place of ␣|␣.

	Tregex	PyTregex
node disjunction	`A\|B`	`A\|B`
node disjunction	`A\|B`	`A␣\|␣B`
condition disjunction	`A<B␣\|␣<C`	`A<B␣\|\|␣<C`
condition disjunction	`A<B␣\|␣<C`	`A<B\|\|<C`
expression disjunction	`A␣\|␣B`	N/A
expression separation	N/A	`A;B`
expression separation	N/A	`A␣;␣B`

In the table above the difference between expression disjunction and expression separation is whether "expressions stop evaluating as soon as the result is known." For example, in Tregex NP=a | NNP=b if NP matches, b will not be assigned even if there is an NNP in the tree, while in PyTregex NP=a ; NNP=b assigns b as long as NNP is found regardless of whether NP matches.

Missing features

Backreferencing

$ tree='(NP NP , NP ,)'
$ pattern='(@NP <, (@NP $+ (/,/ $+ (@NP $+ /,/=comma))) <- =comma)' 

$ echo "$tree" | tregex.sh "$pattern" -filter -s 2>/dev/null
# (NP NP , NP ,)

$ echo "$tree" | python -m pytregex pattern "$pattern" -filter
# (@NP <, (@NP $+ (/,/ $+ (@NP $+ /,/=comma))) <- =comma)
#                                              ˄
# Parsing error at token '='

Headfinders

PyTregex currently has only one HeadFinder which is for English. If your patterns are for trees of other languages and contain <#, >#, <<#, or >>#, they may not work as expected.

Variable groups

$ tree='(SBAR (WHNP-11 (WP who)) (S (NP-SBJ (-NONE- *T*-11)) (VP (VBD resigned))))' 
$ pattern='@SBAR < /^WH.*-([0-9]+)$/#1%index << (__=empty < (/^-NONE-/ < /^\*T\*-([0-9]+)$/#1%index))' 

$ echo "$tree" | tregex.sh "$pattern" -filter 2>/dev/null
# (SBAR
#   (WHNP-11 (WP who))
#   (S
#     (NP-SBJ (-NONE- *T*-11))
#     (VP (VBD resigned))))

$ echo "$tree" | python -m pytregex pattern "$pattern" -filter
# Tokenization error: Illegal character "#"

Acknowledgments

Thanks Galen Andrew, Roger Levy, Anna Rafferty, and John Bauer for their work on Tregex. One-third of PyTregex's code is just translated from Tregex.

This program uses David Beazley's PLY(Python Lex-Yacc) for pattern tokenization and parsing.

Name		Name	Last commit message	Last commit date
Latest commit History 105 Commits
src/pytregex		src/pytregex
tests		tests
.coveragerc		.coveragerc
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
mypy.ini		mypy.ini
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Usage

Command-line

Inline

Differences from Tregex

Missing features

Backreferencing

Headfinders

Variable groups

Acknowledgments

About

Releases

Packages

Languages

License

tanloong/pytregex

Folders and files

Latest commit

History

Repository files navigation

Usage

Command-line

Inline

Differences from Tregex

Missing features

Backreferencing

Headfinders

Variable groups

Acknowledgments

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages