Skip to content

Commit

Permalink
Update README
Browse files Browse the repository at this point in the history
  • Loading branch information
gremid committed Jan 20, 2025
1 parent 056290e commit 99dca42
Showing 1 changed file with 162 additions and 115 deletions.
277 changes: 162 additions & 115 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# DWDSmor – German morphology
# DWDSmor – German Morphology

![PyPI - Version](https://img.shields.io/pypi/v/dwdsmor)
![PyPI - Python Version](https://img.shields.io/pypi/pyversions/dwdsmor)
Expand All @@ -20,8 +20,34 @@ The automata are compiled and traversed via
library and toolbox for finite-state transducers (FSTs). Their
coverage of the German language depends on

1. the DWDSmor grammar, defining the rules by which word formation happens, and
1. a lexicon, assigning inflection classes to lexical words.
1. the DWDSmor grammar, defining the rules by which word formation
happens, and
1. a lexicon, declaring inflection classes and other morphological
properties for covered lexical words.

The grammar, derived from
[SMORLemma](https://github.com/rsennrich/SMORLemma) and providing the
morphology for building automata from lexica, is common to all DWDSmor
installations and published as open source. In contrast we provide
**multiple lexica** resulting in different editions of DWDSmor:

1. the **Open Edition** is based on a subset of the [DWDS
dictionary](https://www.dwds.de/), covering the most common word
forms and released freely with the grammar for general use and
experiments;
1. the **DWDS Edition**, which is derived from the complete lexical
dataset of the DWDS and available upon request for research
purposes.

Depending on the edition and word class, coverage ranges from 70 to
100% with the notable exceptions of foreign language words and named
entities: Generally, both classes are not part of the underlying DWDS
dictionary and thus barely covered. Current overall coverage measured
against the [German Universal Dependencies
treebank](https://universaldependencies.org/treebanks/de_hdt/index.html)
is documented on the respective [Hugging Face Hub
page](https://huggingface.co/zentrum-lexikographie) of each edition.


## Usage

Expand All @@ -31,7 +57,7 @@ DWDSmor as a Python library is available via the package index PyPI:
pip install dwdsmor
```

For lemmatisation:
The library can be used for lemmatisation:

``` python-console
>>> import dwsdmor
Expand All @@ -40,152 +66,174 @@ For lemmatisation:
>>> assert lemmatizer("getestet", pos={"+ADJ"}) == "getestet"
```

Next to the Python API, the package provides a simple command line
interface named `dwdsmor`. To analyze a word form, pass it as an
argument:

```plaintext
$ dwdsmor getestet
| Wordform | Lemma | Analysis | POS | Degree | Function | Nonfinite | Tense | Auxiliary |
|------------|----------|-------------------------------------|-------|----------|------------|-------------|---------|-------------|
| getestet | getestet | ge<~>test<~>et<+ADJ><Pos><Pred/Adv> | +ADJ | Pos | Pred/Adv | | | |
| getestet | testen | test<~>en<+V><Part><Perf><haben> | +V | | | Part | Perf | haben |
```

To generate all word forms for a lexical word, pass it (or a form
which can be analyzed as the lexical word) as an argument together
with the option `-g`:

``` plaintext
$ dwdsmor -g getestet
[…]
| Wordform | Lemma | Analysis | POS | Subcategory | Degree | Function | Person | Gender | Case | Number | Nonfinite | Tense | Mood | Auxiliary | Inflection |
|------------|----------|-------------------------------------------------------------|-------|---------------|----------|------------|----------|----------|--------|----------|-------------|---------|--------|-------------|--------------|
| getestete | getestet | ge<~>test<~>et<+ADJ><Pos><Attr/Subst><Fem><Acc><Sg><St> | +ADJ | | Pos | Attr/Subst | | Fem | Acc | Sg | | | | | St |
| getestete | getestet | ge<~>test<~>et<+ADJ><Pos><Attr/Subst><Fem><Acc><Sg><Wk> | +ADJ | | Pos | Attr/Subst | | Fem | Acc | Sg | | | | | Wk |
| getesteter | getestet | ge<~>test<~>et<+ADJ><Pos><Attr/Subst><Fem><Dat><Sg><St> | +ADJ | | Pos | Attr/Subst | | Fem | Dat | Sg | | | | | St |
| getesteten | getestet | ge<~>test<~>et<+ADJ><Pos><Attr/Subst><Fem><Dat><Sg><Wk> | +ADJ | | Pos | Attr/Subst | | Fem | Dat | Sg | | | | | Wk |
| getesteter | getestet | ge<~>test<~>et<+ADJ><Pos><Attr/Subst><Fem><Gen><Sg><St> | +ADJ | | Pos | Attr/Subst | | Fem | Gen | Sg | | | | | St |
| getesteten | getestet | ge<~>test<~>et<+ADJ><Pos><Attr/Subst><Fem><Gen><Sg><Wk> | +ADJ | | Pos | Attr/Subst | | Fem | Gen | Sg | | | | | Wk |
[…]
| testeten | testen | test<~>en<+V><1><Pl><Past><Ind> | +V | | | | 1 | | | Pl | | Past | Ind | | |
| testeten | testen | test<~>en<+V><1><Pl><Past><Subj> | +V | | | | 1 | | | Pl | | Past | Subj | | |
| testen | testen | test<~>en<+V><1><Pl><Pres><Ind> | +V | | | | 1 | | | Pl | | Pres | Ind | | |
| testen | testen | test<~>en<+V><1><Pl><Pres><Subj> | +V | | | | 1 | | | Pl | | Pres | Subj | | |
| testete | testen | test<~>en<+V><1><Sg><Past><Ind> | +V | | | | 1 | | | Sg | | Past | Ind | | |
| testete | testen | test<~>en<+V><1><Sg><Past><Subj> | +V | | | | 1 | | | Sg | | Past | Subj | | |
| teste | testen | test<~>en<+V><1><Sg><Pres><Ind> | +V | | | | 1 | | | Sg | | Pres | Ind | | |
| teste | testen | test<~>en<+V><1><Sg><Pres><Subj> | +V | | | | 1 | | | Sg | | Pres | Subj | | |
| testetet | testen | test<~>en<+V><2><Pl><Past><Ind> | +V | | | | 2 | | | Pl | | Past | Ind | | |
[…]
```

## Development

This repository provides source code for building DWDSmor lexica and transducers
as well as for using DWDSmor transducers for morphological analysis and paradigm
generation:

* `dwdsmor/` contains Python packages for using DWDSmor, including
scripts for morphological analysis and for paradigm generation by
means of DWDSmor transducers.
* `share/` contains XSLT stylesheets for extracting lexical entries in SMORLemma
format from XML sources of DWDS articles.
* `lexicon/dwds/` contains scripts for building DWDSmor lexica by means of the
XSLT stylesheets in `share/` and DWDS sources in `lexicon/dwds/wb/`, which are
not part of this repository.
* `lexicon/sample/` contains scripts for building sample DWDSmor lexica by means
of the XSLT stylesheets in `share/` and the sample lexicon in
`lexicon/sample/wb/`.
* `grammar/` contains an FST grammar derived from SMORLemma, providing the
morphology for building DWDSmor automata from DWDSmor lexica.
* `test/` implements a test suite for the DWDSmor transducers.

DWDSmor is in active development. In its current stage, DWDSmor supports most
inflection classes and some productive word-formation patterns of written
German. Note that the sample lexicon in `lexicon/sample/wb/` only covers a
sketchy subset of the German vocabulary, and so do the DWDSmor automata compiled
from it.


## Prerequisites

[GNU/Linux](https://www.debian.org/)
: Development, builds and tests of DWDSmor are performed
on [Debian GNU/Linux](https://debian.org/). While other UNIX-like operating
systems such as MacOS should work, too, they are not actively supported.

[Python >= v3.9](https://www.python.org/)
: DWDSmor targets Python as its primary runtime environment. The DWDSmor
transducers can be used via SFST's commandline tools, queried in Python
applications via language-specific
[bindings](https://github.com/gremid/sfst-transduce), or used by the Python
scripts `dwdsmor.py` and `paradigm.py` for morphological analysis and for
paradigm generation.

[Saxon-HE](https://www.saxonica.com/)
: The extraction of lexical entries from XML sources of DWDS articles is
implemented in XSLT 2, for which Saxon-HE is used as the runtime environment.

[Java (JDK) >= v8](https://openjdk.java.net/)
: Saxon requires a Java runtime.

[SFST](https://www.cis.uni-muenchen.de/~schmid/tools/SFST/)
: a C++ library and toolbox for finite-state transducers (FSTs); please take a
look at its homepage for installation and usage instructions.

On a Debian-based distribution, install the following packages:

```sh
apt install python3 default-jdk libsaxonhe-java sfst
DWDSmor is in active development. In its current stage, it supports
most inflection classes and some productive word-formation patterns of
written German.


### Prerequisites

* [GNU/Linux](https://www.debian.org/): Development, builds and tests
of DWDSmor are performed on [Debian
GNU/Linux](https://debian.org/). While other UNIX-like operating
systems such as MacOS should work, too, they are not actively
supported.
* [SFST](https://www.cis.uni-muenchen.de/~schmid/tools/SFST/): a C++
library and toolbox for finite-state transducers (FSTs); please take
a look at its homepage for installation and usage instructions.
* [Python >= v3.9](https://www.python.org/): DWDSmor targets Python as
its primary runtime environment. The DWDSmor transducers can be used
via SFST's commandline tools, queried in Python applications via
language-specific
[bindings](https://github.com/gremid/sfst-transduce), or used by the
Python scripts `dwdsmor.py` and `paradigm.py` for morphological
analysis and for paradigm generation.
* [Saxon-HE](https://www.saxonica.com/): The extraction of lexical
entries from XML sources of DWDS articles is implemented in XSLT 2,
for which Saxon-HE is used as the runtime environment. Saxon
requires [Java](https://openjdk.java.net/)) as a runtime
environment.

On a Debian-based distribution, the following command install the
required software:

```plaintext
apt-get install python3 default-jdk libsaxonhe-java sfst
```

Set up a virtual environment for project builds, for example via Python's `venv`:
### Project setup

```sh
Optionally, set up a Python virtual environment for project builds,
i. e. via Python's `venv`:

```plaintext
python3 -m venv .venv
source .venv/bin/activate
```

Then run the DWDSmor setup routine in order to install Python dependencies:
Then install DWDSmor, including development dependencies:

```sh
pip install -e .[dev]
```plaintext
pip install -U pip setuptools && pip install -e '.[dev]'
```


## Building DWDSmor lexica and transducers
### Building lexica and automata

For building DWDSmor lexica and transducers, run:
Building different editions is facilitated via the script `build-dwdsmor`:

```sh
make all
```

Alternatively, you can run:
```plaintext
$ ./build-dwdsmor --help
usage: cli.py [-h] [--automaton AUTOMATON] [--force] [--with-metrics] [--release] [--tag]
[editions ...]
```sh
make dwds && make dwds-install && make dwdsmor
```

Note that these commands require DWDS sources in `lexicon/dwds/wb/`, which are
not part of this repository.
Build DWDSmor.
Alternatively, you can build sample DWDSmor lexica and transducers from the
sample lexicon in `lexicon/sample/wb/` by running:
positional arguments:
editions Editions to build (all by default)
```sh
make sample && make sample-install && make dwdsmor
options:
-h, --help show this help message and exit
--automaton AUTOMATON
Automaton type to build (all by default)
--force Force building (also current targets)
--with-metrics Measure UD/de-hdt coverage
--release Push automata to HF hub
--tag Tag HF hub release with current version
```

After building DWDSmor transducers, install them into `lib/`, where the
Python scripts `dwdsmor` and `dwdsmor-paradigm` expect them by default:
To build all editions available in the current git checkout, run:

```sh
make install
```plaintext
./build-dwdsmor
```

The installed DWDSmor transducers are:
The build result can be found in `build/` with one subdirectory per
edition. Each edition contains several automata types in standard and
compact format:

* `lib/dwdsmor.{a,ca}`: transducer with inflection and word-formation
components, for lemmatisation and morphological analysis of word forms in
terms of grammatical categories
* `lib/dwdsmor-morph.{a,ca}`: transducer with inflection and word-formation
components, for the generation of morphologically segmented word forms
* `lib/dwdsmor-finite.{a,ca}`: transducer with an inflection component and a

* `lemma.{a,ca}`: transducer with inflection and word-formation
components, for lemmatisation and morphological analysis of word
forms in terms of grammatical categories
* `morph.{a,ca}`: transducer with inflection and word-formation
components, for the generation of morphologically segmented word
forms
* `finite.{a,ca}`: transducer with an inflection component and a
finite word-formation component, for testing purposes
* `lib/dwdsmor-root.{a,ca}`: transducer with inflection and word-formation
components, for lexical analysis of word forms in terms of root lemmas (i.e.,
lemmas of ultimate word-formation bases), word-formation process,
word-formation means, and grammatical categories in term of the
Pattern-and-Restriction Theory of word formation (Nolda 2022)
* `lib/dwdsmor-index.{a,ca}`: transducer with an inflection component only with
* `root.{a,ca}`: transducer with inflection and word-formation
components, for lexical analysis of word forms in terms of root
lemmas (i.e., lemmas of ultimate word-formation bases),
word-formation process, word-formation means, and grammatical
categories in term of the Pattern-and-Restriction Theory of word
formation (Nolda 2022)
* `index.{a,ca}`: transducer with an inflection component only with
DWDS homographic lemma indices, for paradigm generation


## Testing DWDSmor
### Testing

Run
In order to test basic transducer usage and for potential regressions, run

pytest

in order to test basic transducer usage and for potential regressions.

## Contact
## License

Feel free to contact [Andreas Nolda](mailto:[email protected]) for
questions regarding the lexicon or the grammar and
[Gregor Middell](mailto:[email protected]) for question related
to the integration of DWDSmor into your corpus-annotation pipeline.
As the original SMOR and SMORLemma grammars, the DWDSmor grammar and
Python library are licensed under the GNU General Public License
v2.0. The same applies to the open edition of the DWDSmor lexicon.

For the DWDS edition based on the complete DWDS dictionary, all rights
are reserved and individual license terms apply. If you are interested
in the DWDS edition, please contact us.

## License
## Contact

As the original SMOR and SMORLemma grammars, the DWDSmor grammar is
licensed under the GNU General Public Licence v2.0. The same applies
to the rest of this project.
Feel free to contact [Andreas Nolda](mailto:[email protected]) for any
question about this project.

## Credits

Expand All @@ -202,12 +250,11 @@ DWSDmor is based on the following software and datasets:
(Fitschen 2004) as the lexical data source for German words, their grammatical
categories, and their morphological properties.

## Bibliography
## References

* Berlin-Brandenburg Academy of Sciences and Humanities (BBAW) (ed.) (n.d.).
DWDS – Digitales Wörterbuch der deutschen Sprache: Das Wortauskunftssystem zur
deutschen Sprache in Geschichte und Gegenwart.
https://www.dwds.de
deutschen Sprache in Geschichte und Gegenwart. [Online](https://www.dwds.de/)
* Fitschen, Arne (2004). Ein computerlinguistisches Lexikon als komplexes
System. Ph.D. thesis, Universität Stuttgart.
[PDF](http://www.ims.uni-stuttgart.de/forschung/ressourcen/lexika/IMSLex/fitschendiss.pdf)
Expand Down

0 comments on commit 99dca42

Please sign in to comment.