Skip to content

Commit

Permalink
Merge pull request #3 from chartes/update-tooling-and-pathing
Browse files Browse the repository at this point in the history
Update tooling and pathing
  • Loading branch information
PonteIneptique authored Apr 28, 2023
2 parents 8e8bd98 + 0e2c873 commit 431ef41
Show file tree
Hide file tree
Showing 79 changed files with 1,458 additions and 84 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
*.tar
*.csv
*.logs
164 changes: 163 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
- \[DocLing\]: Gleßgen, Martin Dietrich (dir.), et al., _Les plus anciens documents linguistiques de la France_, 2016, [http://www.rose.uzh.ch/docling/](http://www.rose.uzh.ch/docling/), 3e édition.
- \[Geste\]: Camps, Jean-Baptiste (dir.), _Geste: un corpus de chansons de geste_, 2016-… (v02), École nationale des chartes, Paris, 2019, [http://doi.org/10.5281/zenodo.2630574](http://doi.org/10.5281/zenodo.2630574), textes du domaine public, développements CC-BY-SA.
- \[Lancelot\]: Ing, Lucence, _Disparitions lexicales en diachronie: traitements automatiques sur le Lancelot en prose_, thèse de doct. en préparation, dir. F. Duval, codir. J.B. Camps, École nationale des chartes, Université PSL, Paris.
- \[WauchierSConf\] Pinche, Ariane, _Édition nativement numérique du recueil hagiographique ‘Li Seint Confessor’ de Wauchier de Denain d’après le manuscrit fr. 412 de la Bibliothèque nationale de France_, thèse de doctorat dir. C. pierreville et B. Bureau, Université de Lyon, Lyon, 2021.
- \[WauchierSConf\] Pinche, Ariane, _Édition nativement numérique du recueil hagiographique ‘Li Seint Confessor’ de Wauchier de Denain d’après le manuscrit fr. 412 de la Bibliothèque nationale de France_, thèse de doctorat dir. C. pierreville et B. Bureau, Université de Lyon, Lyon, 2021.


The \[Varia\] are composed of short excerpts, taken from the work of students at the École des chartes, annotated in 2020, as part of the evaluation of the course _initiation à la philologie romane: introduction au moyen français_, given by Lucence Ing and Jean-Baptiste Camps (thematic dossier on the plague and medicine, during the first lockdown of 2020 of the COVID19 pandemic)
Expand All @@ -25,3 +25,165 @@ From the ed. by Nicaise, Edouard (1890) p. 167 ff
- Poésies de Gilles li Muisis, published for the first time, according to the manuscript of Lord Ashburnham by baron Kervyn de Lettenhove, Louvain, 1882, https://archive.org/details/posiesdegilles01lemuuoft/page/78/mode/2up,


## Statistics (2023-04-26)


### Token, Lemma and POS counts

| Category | Different | Total | Values with 1 occurrence only |
|------------|-------------|-----------|---------------------------------|
| Forms | 47,661 | 1,183,960 | 23,851 |
| Lemma | 11,295 | 1,183,960 | 3,852 |
| POS | 66 | 1,183,960 | 6 |

### Morphology counts

*Non-x* values means that the category actually applied to the token: a verb will have a DEGRE annotation of x, because verb can't have DEGRE.

| Category | Different | Total | Non-x values |
|------------|-------------|---------|----------------|
| Mode | 6 | 478,657 | 60,740 |
| Temps | 5 | 478,657 | 57,367 |
| Personne | 5 | 478,657 | 106,566 |
| Nombre | 3 | 478,657 | 290,326 |
| Genre | 4 | 478,657 | 226,996 |
| Cas | 4 | 478,657 | 229,586 |
| Degre | 5 | 478,657 | 42,949 |

### POS

| Value | Count |
|---------------|---------|
| NOMcom | 160,410 |
| VERcjg | 156,630 |
| PROper | 96,533 |
| PRE | 91,586 |
| PONfbl | 79,784 |
| ADVgen | 79,578 |
| CONcoo | 66,658 |
| DETdef | 57,655 |
| PONfrt | 42,489 |
| CONsub | 40,120 |
| VERppe | 35,647 |
| ADJqua | 31,675 |
| VERinf | 28,218 |
| NOMpro | 27,872 |
| ADVneg | 25,947 |
| PROrel | 25,542 |
| DETpos | 22,367 |
| PROadv | 15,003 |
| PRE.DETdef | 14,836 |
| PROdem | 14,327 |
| PROind | 11,661 |
| DETind | 10,985 |
| PONpga | 7,707 |
| DETndf | 7,076 |
| DETdem | 6,057 |
| PONpdr | 4,842 |
| DETcar | 3,229 |
| VERppa | 2,784 |
| ADJind | 2,575 |
| PROimp | 2,036 |
| PROcar | 1,855 |
| ADJcar | 1,277 |
| ADJpos | 1,049 |
| PROint | 1,014 |
| PONpxx | 1,012 |
| ADVneg.PROper | 952 |
| PROpos | 669 |
| ADJord | 636 |
| ADVsub | 592 |
| INJ | 549 |
| ADVint | 506 |
| DETrel | 448 |
| PROord | 327 |
| PROper.PROper | 311 |
| ADVgen.PROper | 271 |
| DETint | 225 |
| PRE.PROdem | 151 |
| DETcom | 52 |
| PRE.PROper | 47 |
| PROrel.PROper | 46 |
| RED | 34 |
| ETR | 33 |
| CONsub.PROper | 18 |
| ADVgen.CONsub | 16 |
| PRE.DETcom | 12 |
| DETord | 8 |
| ADJqua.NOMcom | 7 |
| PRE.PROrel | 4 |
| ADVing | 2 |
| ADVneg.PROadv | 2 |
| PROint.PROper | 1 |
| CONsubs | 1 |
| ADVgen.PROadv | 1 |
| NomPro | 1 |
| PRE.DETrel | 1 |
| CONsub.DETdef | 1 |

### Mode

| Value | Count |
|-----------|---------|
| MODE=x | 417,917 |
| MODE=ind | 51,951 |
| MODE=sub | 5,416 |
| MODE=imp | 2,061 |
| MODE=con | 1,311 |
| MODE=cond | 1 |

### Temps

| Value | Count |
|-----------|---------|
| TEMPS=x | 421,290 |
| TEMPS=pst | 29,150 |
| TEMPS=psp | 14,882 |
| TEMPS=ipf | 9,012 |
| TEMPS=fut | 4,323 |

### Personne

| Value | Count |
|---------|---------|
| PERS.=x | 372,091 |
| PERS.=3 | 76,497 |
| PERS.=1 | 18,377 |
| PERS.=2 | 11,455 |
| PERS.=0 | 237 |

### Nombre

| Value | Count |
|---------|---------|
| NOMB.=s | 218,952 |
| NOMB.=x | 188,331 |
| NOMB.=p | 71,374 |

### Genre

| Value | Count |
|---------|---------|
| GENRE=x | 251,661 |
| GENRE=m | 155,955 |
| GENRE=f | 63,962 |
| GENRE=n | 7,079 |

### Cas

| Value | Count |
|---------|---------|
| CAS=x | 249,071 |
| CAS=r | 145,693 |
| CAS=n | 75,652 |
| CAS=i | 8,241 |

### Degre

| Value | Count |
|---------|---------|
| DEGRE=x | 435,708 |
| DEGRE=- | 24,947 |
| DEGRE=p | 16,622 |
| DEGRE=c | 910 |
| DEGRE=s | 470 |
3 changes: 3 additions & 0 deletions tooling/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
env
output-*
*memory.csv
2 changes: 2 additions & 0 deletions tooling/00-install.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
virtualenv env -p python3
env/bin/pip install -r requirements.txt
5 changes: 5 additions & 0 deletions tooling/01-build.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
rm -r output-*
env/bin/protogenie build config-lemma-pos.xml --output output-lemma-pos -t .98 -d .02 -e 0 --verbose
env/bin/protogenie concat config-lemma-pos.xml output-lemma-pos
env/bin/protogenie build config-morph.xml --output output-morph -t .98 -d .02 -e 0 --verbose
env/bin/protogenie concat config-morph.xml output-morph
2 changes: 2 additions & 0 deletions tooling/02-build-test.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
rm -r output-test
env/bin/protogenie build config-test.xml --output output-test -n --verbose
44 changes: 44 additions & 0 deletions tooling/config-lemma-pos.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="https://hipster-philology.github.io/protogenie/protogenie/schema.rng"
schematypens="http://relaxng.org/ns/structure/1.0"?>
<config xmlns:xi="http://www.w3.org/2001/XInclude">
<default-header>
<header type="explicit">
<key map-to="token">form</key>
<key>lemma</key>
<key>POS</key>
</header>
</default-header>
<memory path="lemma-pos.memory.csv"/>
<output column_marker="TAB">
<header name="order">
<key>token</key>
<key>lemma</key>
<key>POS</key>
</header>
</output>
<postprocessing>
<skip matchPattern="token" source="token"/>
<skip matchPattern="OUT" source="POS"/>
<toolbox name="RomanNumeral">
<applyTo source="token">
<target>lemma</target>
<target>token</target>
</applyTo>
</toolbox>
<replacement matchPattern="[01]" replacementPattern="1">
<applyTo source="token">
<target>lemma</target>
<target>token</target>
</applyTo>
</replacement>
<replacement matchPattern="[2-9]|\d\d+" replacementPattern="2">
<applyTo source="token">
<target>lemma</target>
<target>token</target>
</applyTo>
</replacement>
</postprocessing>
<xi:include href="./corpora/without-morph.xml" parse="xml" />
<xi:include href="./corpora/with-morph.xml" parse="xml" />
</config>
60 changes: 60 additions & 0 deletions tooling/config-morph.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="https://hipster-philology.github.io/protogenie/protogenie/schema.rng"
schematypens="http://relaxng.org/ns/structure/1.0"?>
<config xmlns:xi="http://www.w3.org/2001/XInclude">
<default-header>
<header type="explicit">
<key map-to="token">form</key>
<key>lemma</key>
<key>POS</key>
<key>morph</key>
</header>
</default-header>
<memory path="morph.memory.csv"/>
<output column_marker="TAB">
<header name="order">
<key>token</key>
<key>lemma</key>
<key>POS</key>
<key>MODE</key>
<key>TEMPS</key>
<key>PERS</key>
<key>NOMB</key>
<key>GENRE</key>
<key>CAS</key>
<key>DEGRE</key>
<key>SPEC</key>
</header>
</output>
<postprocessing>
<skip matchPattern="token" source="token"/>
<skip matchPattern="OUT" source="POS"/>
<toolbox name="RomanNumeral">
<applyTo source="token">
<target>lemma</target>
<target>token</target>
</applyTo>
</toolbox>
<replacement matchPattern="[01]" replacementPattern="1">
<applyTo source="token">
<target>lemma</target>
<target>token</target>
</applyTo>
</replacement>
<replacement matchPattern="[2-9]|\d\d+" replacementPattern="2">
<applyTo source="token">
<target>lemma</target>
<target>token</target>
</applyTo>
</replacement>
<disambiguation matchPattern="(MODE\=[\w-]+)\|?" new-column="MODE" source="morph" default="MODE=x" />
<disambiguation matchPattern="(TEMPS\=[\w-]+)\|?" new-column="TEMPS" source="morph" default="TEMPS=x" />
<disambiguation matchPattern="(PERS\.\=[\w-]+)\|?" new-column="PERS" source="morph" default="PERS.=x" />
<disambiguation matchPattern="(NOMB\.\=[\w-]+)\|?" new-column="NOMB" source="morph" default="NOMB.=x" />
<disambiguation matchPattern="(GENRE\=[\w-]+)\|?" new-column="GENRE" source="morph" default="GENRE=x" />
<disambiguation matchPattern="(CAS\=[\w-]+)\|?" new-column="CAS" source="morph" default="CAS=x" />
<disambiguation matchPattern="(DEGRE\=[\w-]+)\|?" new-column="DEGRE" source="morph" default="DEGRE=x" />
<disambiguation matchPattern="(SPEC\=[\w-]+)\|?" new-column="SPEC" source="morph" default="SPEC=x" />
</postprocessing>
<xi:include href="./corpora/with-morph.xml" parse="xml" />
</config>
68 changes: 68 additions & 0 deletions tooling/config-test.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="https://hipster-philology.github.io/protogenie/protogenie/schema.rng"
schematypens="http://relaxng.org/ns/structure/1.0"?>
<config xmlns:xi="http://www.w3.org/2001/XInclude">
<default-header>
<header type="explicit">
<key map-to="token">form</key>
<key>lemma</key>
<key>POS</key>
<key>morph</key>
</header>
</default-header>
<memory path="morph.memory.csv"/>
<output column_marker="TAB">
<header name="order">
<key>token</key>
<key>lemma</key>
<key>POS</key>
<key>MODE</key>
<key>TEMPS</key>
<key>PERS</key>
<key>NOMB</key>
<key>GENRE</key>
<key>CAS</key>
<key>DEGRE</key>
<key>SPEC</key>
</header>
</output>
<postprocessing>
<skip matchPattern="token" source="token"/>
<skip matchPattern="OUT" source="POS"/>
<toolbox name="RomanNumeral">
<applyTo source="token">
<target>lemma</target>
<target>token</target>
</applyTo>
</toolbox>
<replacement matchPattern="[01]" replacementPattern="1">
<applyTo source="token">
<target>lemma</target>
<target>token</target>
</applyTo>
</replacement>
<replacement matchPattern="[2-9]|\d\d+" replacementPattern="2">
<applyTo source="token">
<target>lemma</target>
<target>token</target>
</applyTo>
</replacement>
<disambiguation matchPattern="(MODE\=[\w-]+)\|?" new-column="MODE" source="morph" default="MODE=x" />
<disambiguation matchPattern="(TEMPS\=[\w-]+)\|?" new-column="TEMPS" source="morph" default="TEMPS=x" />
<disambiguation matchPattern="(PERS\.\=[\w-]+)\|?" new-column="PERS" source="morph" default="PERS.=x" />
<disambiguation matchPattern="(NOMB\.\=[\w-]+)\|?" new-column="NOMB" source="morph" default="NOMB.=x" />
<disambiguation matchPattern="(GENRE\=[\w-]+)\|?" new-column="GENRE" source="morph" default="GENRE=x" />
<disambiguation matchPattern="(CAS\=[\w-]+)\|?" new-column="CAS" source="morph" default="CAS=x" />
<disambiguation matchPattern="(DEGRE\=[\w-]+)\|?" new-column="DEGRE" source="morph" default="DEGRE=x" />
<disambiguation matchPattern="(SPEC\=[\w-]+)\|?" new-column="SPEC" source="morph" default="SPEC=x" />
</postprocessing>
<corpora>
<corpus path="../test/*.tsv" column_marker="TAB">
<splitter name="regexp">
<option matchPattern="PONfrt" source="POS"/>
<option matchPattern="Ref\." source="lemma"/>
</splitter>
<header type="default" />
</corpus>
</corpora>
</config>
13 changes: 13 additions & 0 deletions tooling/corpora/with-morph.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
<corpora>
<corpus path="../tsv/LemmaPosMorph/PONfrt/*.tsv" column_marker="TAB">
<splitter name="regexp">
<option matchPattern="PONfrt" source="POS"/>
<option matchPattern="Ref\." source="lemma"/>
</splitter>
<header type="default" />
</corpus>
<corpus path="../tsv/LemmaPosMorph/EmptyLine/*.tsv" column_marker="TAB">
<splitter name="empty_line" />
<header type="default" />
</corpus>
</corpora>
9 changes: 9 additions & 0 deletions tooling/corpora/without-morph.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
<corpora>
<corpus path="../tsv/LemmaPos/*.tsv" column_marker="TAB">
<splitter name="regexp">
<option matchPattern="PONfrt" source="POS"/>
<option matchPattern="Ref\." source="lemma"/>
</splitter>
<header type="default" />
</corpus>
</corpora>
Loading

0 comments on commit 431ef41

Please sign in to comment.