-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
162 additions
and
115 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,4 @@ | ||
# DWDSmor – German morphology | ||
# DWDSmor – German Morphology | ||
|
||
 | ||
 | ||
|
@@ -20,8 +20,34 @@ The automata are compiled and traversed via | |
library and toolbox for finite-state transducers (FSTs). Their | ||
coverage of the German language depends on | ||
|
||
1. the DWDSmor grammar, defining the rules by which word formation happens, and | ||
1. a lexicon, assigning inflection classes to lexical words. | ||
1. the DWDSmor grammar, defining the rules by which word formation | ||
happens, and | ||
1. a lexicon, declaring inflection classes and other morphological | ||
properties for covered lexical words. | ||
|
||
The grammar, derived from | ||
[SMORLemma](https://github.com/rsennrich/SMORLemma) and providing the | ||
morphology for building automata from lexica, is common to all DWDSmor | ||
installations and published as open source. In contrast we provide | ||
**multiple lexica** resulting in different editions of DWDSmor: | ||
|
||
1. the **Open Edition** is based on a subset of the [DWDS | ||
dictionary](https://www.dwds.de/), covering the most common word | ||
forms and released freely with the grammar for general use and | ||
experiments; | ||
1. the **DWDS Edition**, which is derived from the complete lexical | ||
dataset of the DWDS and available upon request for research | ||
purposes. | ||
|
||
Depending on the edition and word class, coverage ranges from 70 to | ||
100% with the notable exceptions of foreign language words and named | ||
entities: Generally, both classes are not part of the underlying DWDS | ||
dictionary and thus barely covered. Current overall coverage measured | ||
against the [German Universal Dependencies | ||
treebank](https://universaldependencies.org/treebanks/de_hdt/index.html) | ||
is documented on the respective [Hugging Face Hub | ||
page](https://huggingface.co/zentrum-lexikographie) of each edition. | ||
|
||
|
||
## Usage | ||
|
||
|
@@ -31,7 +57,7 @@ DWDSmor as a Python library is available via the package index PyPI: | |
pip install dwdsmor | ||
``` | ||
|
||
For lemmatisation: | ||
The library can be used for lemmatisation: | ||
|
||
``` python-console | ||
>>> import dwsdmor | ||
|
@@ -40,152 +66,174 @@ For lemmatisation: | |
>>> assert lemmatizer("getestet", pos={"+ADJ"}) == "getestet" | ||
``` | ||
|
||
… | ||
Next to the Python API, the package provides a simple command line | ||
interface named `dwdsmor`. To analyze a word form, pass it as an | ||
argument: | ||
|
||
```plaintext | ||
$ dwdsmor getestet | ||
| Wordform | Lemma | Analysis | POS | Degree | Function | Nonfinite | Tense | Auxiliary | | ||
|------------|----------|-------------------------------------|-------|----------|------------|-------------|---------|-------------| | ||
| getestet | getestet | ge<~>test<~>et<+ADJ><Pos><Pred/Adv> | +ADJ | Pos | Pred/Adv | | | | | ||
| getestet | testen | test<~>en<+V><Part><Perf><haben> | +V | | | Part | Perf | haben | | ||
``` | ||
|
||
To generate all word forms for a lexical word, pass it (or a form | ||
which can be analyzed as the lexical word) as an argument together | ||
with the option `-g`: | ||
|
||
``` plaintext | ||
$ dwdsmor -g getestet | ||
[…] | ||
| Wordform | Lemma | Analysis | POS | Subcategory | Degree | Function | Person | Gender | Case | Number | Nonfinite | Tense | Mood | Auxiliary | Inflection | | ||
|------------|----------|-------------------------------------------------------------|-------|---------------|----------|------------|----------|----------|--------|----------|-------------|---------|--------|-------------|--------------| | ||
| getestete | getestet | ge<~>test<~>et<+ADJ><Pos><Attr/Subst><Fem><Acc><Sg><St> | +ADJ | | Pos | Attr/Subst | | Fem | Acc | Sg | | | | | St | | ||
| getestete | getestet | ge<~>test<~>et<+ADJ><Pos><Attr/Subst><Fem><Acc><Sg><Wk> | +ADJ | | Pos | Attr/Subst | | Fem | Acc | Sg | | | | | Wk | | ||
| getesteter | getestet | ge<~>test<~>et<+ADJ><Pos><Attr/Subst><Fem><Dat><Sg><St> | +ADJ | | Pos | Attr/Subst | | Fem | Dat | Sg | | | | | St | | ||
| getesteten | getestet | ge<~>test<~>et<+ADJ><Pos><Attr/Subst><Fem><Dat><Sg><Wk> | +ADJ | | Pos | Attr/Subst | | Fem | Dat | Sg | | | | | Wk | | ||
| getesteter | getestet | ge<~>test<~>et<+ADJ><Pos><Attr/Subst><Fem><Gen><Sg><St> | +ADJ | | Pos | Attr/Subst | | Fem | Gen | Sg | | | | | St | | ||
| getesteten | getestet | ge<~>test<~>et<+ADJ><Pos><Attr/Subst><Fem><Gen><Sg><Wk> | +ADJ | | Pos | Attr/Subst | | Fem | Gen | Sg | | | | | Wk | | ||
[…] | ||
| testeten | testen | test<~>en<+V><1><Pl><Past><Ind> | +V | | | | 1 | | | Pl | | Past | Ind | | | | ||
| testeten | testen | test<~>en<+V><1><Pl><Past><Subj> | +V | | | | 1 | | | Pl | | Past | Subj | | | | ||
| testen | testen | test<~>en<+V><1><Pl><Pres><Ind> | +V | | | | 1 | | | Pl | | Pres | Ind | | | | ||
| testen | testen | test<~>en<+V><1><Pl><Pres><Subj> | +V | | | | 1 | | | Pl | | Pres | Subj | | | | ||
| testete | testen | test<~>en<+V><1><Sg><Past><Ind> | +V | | | | 1 | | | Sg | | Past | Ind | | | | ||
| testete | testen | test<~>en<+V><1><Sg><Past><Subj> | +V | | | | 1 | | | Sg | | Past | Subj | | | | ||
| teste | testen | test<~>en<+V><1><Sg><Pres><Ind> | +V | | | | 1 | | | Sg | | Pres | Ind | | | | ||
| teste | testen | test<~>en<+V><1><Sg><Pres><Subj> | +V | | | | 1 | | | Sg | | Pres | Subj | | | | ||
| testetet | testen | test<~>en<+V><2><Pl><Past><Ind> | +V | | | | 2 | | | Pl | | Past | Ind | | | | ||
[…] | ||
``` | ||
|
||
## Development | ||
|
||
This repository provides source code for building DWDSmor lexica and transducers | ||
as well as for using DWDSmor transducers for morphological analysis and paradigm | ||
generation: | ||
|
||
* `dwdsmor/` contains Python packages for using DWDSmor, including | ||
scripts for morphological analysis and for paradigm generation by | ||
means of DWDSmor transducers. | ||
* `share/` contains XSLT stylesheets for extracting lexical entries in SMORLemma | ||
format from XML sources of DWDS articles. | ||
* `lexicon/dwds/` contains scripts for building DWDSmor lexica by means of the | ||
XSLT stylesheets in `share/` and DWDS sources in `lexicon/dwds/wb/`, which are | ||
not part of this repository. | ||
* `lexicon/sample/` contains scripts for building sample DWDSmor lexica by means | ||
of the XSLT stylesheets in `share/` and the sample lexicon in | ||
`lexicon/sample/wb/`. | ||
* `grammar/` contains an FST grammar derived from SMORLemma, providing the | ||
morphology for building DWDSmor automata from DWDSmor lexica. | ||
* `test/` implements a test suite for the DWDSmor transducers. | ||
|
||
DWDSmor is in active development. In its current stage, DWDSmor supports most | ||
inflection classes and some productive word-formation patterns of written | ||
German. Note that the sample lexicon in `lexicon/sample/wb/` only covers a | ||
sketchy subset of the German vocabulary, and so do the DWDSmor automata compiled | ||
from it. | ||
|
||
|
||
## Prerequisites | ||
|
||
[GNU/Linux](https://www.debian.org/) | ||
: Development, builds and tests of DWDSmor are performed | ||
on [Debian GNU/Linux](https://debian.org/). While other UNIX-like operating | ||
systems such as MacOS should work, too, they are not actively supported. | ||
|
||
[Python >= v3.9](https://www.python.org/) | ||
: DWDSmor targets Python as its primary runtime environment. The DWDSmor | ||
transducers can be used via SFST's commandline tools, queried in Python | ||
applications via language-specific | ||
[bindings](https://github.com/gremid/sfst-transduce), or used by the Python | ||
scripts `dwdsmor.py` and `paradigm.py` for morphological analysis and for | ||
paradigm generation. | ||
|
||
[Saxon-HE](https://www.saxonica.com/) | ||
: The extraction of lexical entries from XML sources of DWDS articles is | ||
implemented in XSLT 2, for which Saxon-HE is used as the runtime environment. | ||
|
||
[Java (JDK) >= v8](https://openjdk.java.net/) | ||
: Saxon requires a Java runtime. | ||
|
||
[SFST](https://www.cis.uni-muenchen.de/~schmid/tools/SFST/) | ||
: a C++ library and toolbox for finite-state transducers (FSTs); please take a | ||
look at its homepage for installation and usage instructions. | ||
|
||
On a Debian-based distribution, install the following packages: | ||
|
||
```sh | ||
apt install python3 default-jdk libsaxonhe-java sfst | ||
DWDSmor is in active development. In its current stage, it supports | ||
most inflection classes and some productive word-formation patterns of | ||
written German. | ||
|
||
|
||
### Prerequisites | ||
|
||
* [GNU/Linux](https://www.debian.org/): Development, builds and tests | ||
of DWDSmor are performed on [Debian | ||
GNU/Linux](https://debian.org/). While other UNIX-like operating | ||
systems such as MacOS should work, too, they are not actively | ||
supported. | ||
* [SFST](https://www.cis.uni-muenchen.de/~schmid/tools/SFST/): a C++ | ||
library and toolbox for finite-state transducers (FSTs); please take | ||
a look at its homepage for installation and usage instructions. | ||
* [Python >= v3.9](https://www.python.org/): DWDSmor targets Python as | ||
its primary runtime environment. The DWDSmor transducers can be used | ||
via SFST's commandline tools, queried in Python applications via | ||
language-specific | ||
[bindings](https://github.com/gremid/sfst-transduce), or used by the | ||
Python scripts `dwdsmor.py` and `paradigm.py` for morphological | ||
analysis and for paradigm generation. | ||
* [Saxon-HE](https://www.saxonica.com/): The extraction of lexical | ||
entries from XML sources of DWDS articles is implemented in XSLT 2, | ||
for which Saxon-HE is used as the runtime environment. Saxon | ||
requires [Java](https://openjdk.java.net/)) as a runtime | ||
environment. | ||
|
||
On a Debian-based distribution, the following command install the | ||
required software: | ||
|
||
```plaintext | ||
apt-get install python3 default-jdk libsaxonhe-java sfst | ||
``` | ||
|
||
Set up a virtual environment for project builds, for example via Python's `venv`: | ||
### Project setup | ||
|
||
```sh | ||
Optionally, set up a Python virtual environment for project builds, | ||
i. e. via Python's `venv`: | ||
|
||
```plaintext | ||
python3 -m venv .venv | ||
source .venv/bin/activate | ||
``` | ||
|
||
Then run the DWDSmor setup routine in order to install Python dependencies: | ||
Then install DWDSmor, including development dependencies: | ||
|
||
```sh | ||
pip install -e .[dev] | ||
```plaintext | ||
pip install -U pip setuptools && pip install -e '.[dev]' | ||
``` | ||
|
||
|
||
## Building DWDSmor lexica and transducers | ||
### Building lexica and automata | ||
|
||
For building DWDSmor lexica and transducers, run: | ||
Building different editions is facilitated via the script `build-dwdsmor`: | ||
|
||
```sh | ||
make all | ||
``` | ||
|
||
Alternatively, you can run: | ||
```plaintext | ||
$ ./build-dwdsmor --help | ||
usage: cli.py [-h] [--automaton AUTOMATON] [--force] [--with-metrics] [--release] [--tag] | ||
[editions ...] | ||
```sh | ||
make dwds && make dwds-install && make dwdsmor | ||
``` | ||
|
||
Note that these commands require DWDS sources in `lexicon/dwds/wb/`, which are | ||
not part of this repository. | ||
Build DWDSmor. | ||
Alternatively, you can build sample DWDSmor lexica and transducers from the | ||
sample lexicon in `lexicon/sample/wb/` by running: | ||
positional arguments: | ||
editions Editions to build (all by default) | ||
```sh | ||
make sample && make sample-install && make dwdsmor | ||
options: | ||
-h, --help show this help message and exit | ||
--automaton AUTOMATON | ||
Automaton type to build (all by default) | ||
--force Force building (also current targets) | ||
--with-metrics Measure UD/de-hdt coverage | ||
--release Push automata to HF hub | ||
--tag Tag HF hub release with current version | ||
``` | ||
|
||
After building DWDSmor transducers, install them into `lib/`, where the | ||
Python scripts `dwdsmor` and `dwdsmor-paradigm` expect them by default: | ||
To build all editions available in the current git checkout, run: | ||
|
||
```sh | ||
make install | ||
```plaintext | ||
./build-dwdsmor | ||
``` | ||
|
||
The installed DWDSmor transducers are: | ||
The build result can be found in `build/` with one subdirectory per | ||
edition. Each edition contains several automata types in standard and | ||
compact format: | ||
|
||
* `lib/dwdsmor.{a,ca}`: transducer with inflection and word-formation | ||
components, for lemmatisation and morphological analysis of word forms in | ||
terms of grammatical categories | ||
* `lib/dwdsmor-morph.{a,ca}`: transducer with inflection and word-formation | ||
components, for the generation of morphologically segmented word forms | ||
* `lib/dwdsmor-finite.{a,ca}`: transducer with an inflection component and a | ||
|
||
* `lemma.{a,ca}`: transducer with inflection and word-formation | ||
components, for lemmatisation and morphological analysis of word | ||
forms in terms of grammatical categories | ||
* `morph.{a,ca}`: transducer with inflection and word-formation | ||
components, for the generation of morphologically segmented word | ||
forms | ||
* `finite.{a,ca}`: transducer with an inflection component and a | ||
finite word-formation component, for testing purposes | ||
* `lib/dwdsmor-root.{a,ca}`: transducer with inflection and word-formation | ||
components, for lexical analysis of word forms in terms of root lemmas (i.e., | ||
lemmas of ultimate word-formation bases), word-formation process, | ||
word-formation means, and grammatical categories in term of the | ||
Pattern-and-Restriction Theory of word formation (Nolda 2022) | ||
* `lib/dwdsmor-index.{a,ca}`: transducer with an inflection component only with | ||
* `root.{a,ca}`: transducer with inflection and word-formation | ||
components, for lexical analysis of word forms in terms of root | ||
lemmas (i.e., lemmas of ultimate word-formation bases), | ||
word-formation process, word-formation means, and grammatical | ||
categories in term of the Pattern-and-Restriction Theory of word | ||
formation (Nolda 2022) | ||
* `index.{a,ca}`: transducer with an inflection component only with | ||
DWDS homographic lemma indices, for paradigm generation | ||
|
||
|
||
## Testing DWDSmor | ||
### Testing | ||
|
||
Run | ||
In order to test basic transducer usage and for potential regressions, run | ||
|
||
pytest | ||
|
||
in order to test basic transducer usage and for potential regressions. | ||
|
||
## Contact | ||
## License | ||
|
||
Feel free to contact [Andreas Nolda](mailto:[email protected]) for | ||
questions regarding the lexicon or the grammar and | ||
[Gregor Middell](mailto:[email protected]) for question related | ||
to the integration of DWDSmor into your corpus-annotation pipeline. | ||
As the original SMOR and SMORLemma grammars, the DWDSmor grammar and | ||
Python library are licensed under the GNU General Public License | ||
v2.0. The same applies to the open edition of the DWDSmor lexicon. | ||
|
||
For the DWDS edition based on the complete DWDS dictionary, all rights | ||
are reserved and individual license terms apply. If you are interested | ||
in the DWDS edition, please contact us. | ||
|
||
## License | ||
## Contact | ||
|
||
As the original SMOR and SMORLemma grammars, the DWDSmor grammar is | ||
licensed under the GNU General Public Licence v2.0. The same applies | ||
to the rest of this project. | ||
Feel free to contact [Andreas Nolda](mailto:[email protected]) for any | ||
question about this project. | ||
|
||
## Credits | ||
|
||
|
@@ -202,12 +250,11 @@ DWSDmor is based on the following software and datasets: | |
(Fitschen 2004) as the lexical data source for German words, their grammatical | ||
categories, and their morphological properties. | ||
|
||
## Bibliography | ||
## References | ||
|
||
* Berlin-Brandenburg Academy of Sciences and Humanities (BBAW) (ed.) (n.d.). | ||
DWDS – Digitales Wörterbuch der deutschen Sprache: Das Wortauskunftssystem zur | ||
deutschen Sprache in Geschichte und Gegenwart. | ||
https://www.dwds.de | ||
deutschen Sprache in Geschichte und Gegenwart. [Online](https://www.dwds.de/) | ||
* Fitschen, Arne (2004). Ein computerlinguistisches Lexikon als komplexes | ||
System. Ph.D. thesis, Universität Stuttgart. | ||
[PDF](http://www.ims.uni-stuttgart.de/forschung/ressourcen/lexika/IMSLex/fitschendiss.pdf) | ||
|