A set of NiFi processors implementing Apache OpenNLP engine tools.
Project has been generated using Maven archetype
org.apache.nifi:nifi-processor-bundle-archetype:1.8.0
It is a Java 8 project built by Maven 3.3+ and following Maven layout conventions.
One can find a docker-compose setup to run NiFi locally with a predefined workflow, present as examples of use.
You can build project then reuse the nar file produced in your NiFi or boot a Docker container ready to use.
Maven commands are available to build the project, using
mvn clean package
This will run the tests locally and prepare a nar file that you can drop into your current nifi install, should you have one.
Simply run the docker-compose file using
docker-compose up
Build is done inside the container, as a separate maven layer, so expect to wait a few seconds for Maven to download the internet.
Then the nar file is copied into NiFi lib/
folder and NiFi is started as a container,
available on the port 8080.
The configuration directory for NiFi ($NIFI_HOME/conf
or /opt/nifi/nifi-current/conf
)
has been mapped to the local folder ./nifi-local-data/conf
.
A new NiFi folder exists under $NIFI_HOME/models
that contains the pre-trained
models for English language:
en-chunker.bin
en-doccat-tweets.bin
en-ner-date.bin
en-ner-location.bin
en-ner-money.bin
en-ner-organization.bin
en-ner-percentage.bin
en-ner-person.bin
en-ner-time.bin
en-parser-chunking.bin
en-pos-maxent.bin
en-pos-perceptron.bin
en-sent.bin
en-token.bin
langdetect-183.bin
A new NiFi folder exists under $NIFI_HOME/training
that contains tweets.txt
, an example of training data
for sentiment analysis on tweets (see Document Categorizer)
taken from this discussion on StackOverflow.
Another new folder under $NIFI_HOME/model-store
is present and will hold the trained models for the processors.
The rationale is that processors can be trained using both model files, training files and training data so input types differ, but at the end of the day, it all ends in a model file that can be stored and reused by the processors. Lifecycle of processors training/evaluation will be explained further.
Importing from Jitpack
Add Jitpack repository in your maven project:
<repositories>
<repository>
<id>jitpack.io</id>
<url>https://jitpack.io</url>
</repository>
</repositories>
and the maven dependency on the github project:
<dependency>
<groupId>com.github.rdlopes</groupId>
<artifactId>nifi-open-nlp</artifactId>
<version>${nifi-open-nlp.version}</version>
</dependency>
Importing from GitHub Package Registry
The feature is temporarily disabled, I'm waiting for GitHub feedback on few issues.
Following tools listed in the OpenNLP developer documentation are implemented:
- Language Detector
- Sentence Detector
- Tokenizer
- Name Finder
- Document Categorizer
- Part-of-Speech Tagger
- Lemmatizer
- Chunker
- Parser
For further documentation, please refer to processors usage page.