Skip to content
This repository has been archived by the owner on Dec 4, 2022. It is now read-only.

Commit

Permalink
Backport RDRPOSTagger from http://rdrpostagger.sourceforge.net/. This…
Browse files Browse the repository at this point in the history
… backport includes the pull-request #17 : datquocnguyen/RDRPOSTagger#17
  • Loading branch information
Cyrille Savelief committed Aug 31, 2020
1 parent 083ed20 commit d201d7a
Show file tree
Hide file tree
Showing 932 changed files with 7,840,056 additions and 44 deletions.
27 changes: 10 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -343,6 +343,7 @@ The [Languages](src/com/computablefacts/nona/helpers/Languages.java) class
contains helpers to :

- Perform [language identification](https://en.wikipedia.org/wiki/Language_identification),
- Perform [POS-tagging](https://en.wikipedia.org/wiki/Part-of-speech_tagging),
- Perform [stemming](https://en.wikipedia.org/wiki/Stemming),
- Load lists of [stopwords](https://en.wikipedia.org/wiki/Stop_word),

Expand Down Expand Up @@ -380,6 +381,8 @@ Note that all libraries used are business-friendly :
identification algorithm is licenced under the Apache 2 Licence.
- The [Snowball](https://snowballstem.org/license.html) stemmers are licenced
under the 3-clause BSD Licence.
- The [RDRPOSTagger](http://rdrpostagger.sourceforge.net/) POS-tagger is licenced
under GPL-3 Licence.
- The [Solr](https://lucene.apache.org) lists of stopwords are licenced under
the Apache 2 Licence.

Expand All @@ -404,24 +407,14 @@ String stemHa = stemmer.getCurrent(); // "ha"
stemmer.setCurrent(words.get(1));
String stemBisogno = stemmer.getCurrent(); // "bisogn"

stemmer.setCurrent(words.get(2));
String stemDi = stemmer.getCurrent(); // "di"

stemmer.setCurrent(words.get(3));
String stemUna = stemmer.getCurrent(); // "una"

stemmer.setCurrent(words.get(4));
String stemTazza = stemmer.getCurrent(); // "tazz"

stemmer.setCurrent(words.get(5));
String stemDi = stemmer.getCurrent(); // "di"
...

stemmer.setCurrent(words.get(6));
String stemZucchero = stemmer.getCurrent(); // "zuccher"
// Load stopwords for italian
Set<String> stopwords = Languages.stopwords(Languages.eLanguage.ITALIAN);

stemmer.setCurrent(words.get(7));
String stemDot = stemmer.getCurrent(); // "."
// Execute POS tagger
List<Map.Entry<String, String>> tags = Languages.tag(Languages.eLanguage.ITALIAN, sentence);

// Load stopwords for italian
Set<String> stopwords = Languages.stopwords(Languages.eLanguage.ITALIAN)
// Here, tags = [{Ha,V}, {bisogno,S}, {di,E}, {una,RI}, {tazza,S}, {di,E}, {zucchero,S}, {.,FS}]
// See http://medialab.di.unipi.it/wiki/Tanl_POS_Tagset for tags meanings
```
2 changes: 1 addition & 1 deletion pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -186,7 +186,7 @@
<arg>-Xep:Var:ERROR</arg>
<arg>-Xep:WildcardImport</arg>
<arg>-XepDisableWarningsInGeneratedCode</arg>
<arg>-XepExcludedPaths:.*/org/tartarus/snowball/.*</arg>
<arg>-XepExcludedPaths:.*/org/tartarus/snowball/.*|.*/RDRPOSTagger/jSCRDRtagger/.*</arg>
</compilerArgs>
<forceJavacCompilerUse>true</forceJavacCompilerUse>
<!-- maven-compiler-plugin defaults to targeting Java 5, but our javac only supports >=6 -->
Expand Down
Loading

0 comments on commit d201d7a

Please sign in to comment.