Skip to content
This repository has been archived by the owner on Sep 15, 2022. It is now read-only.
/ nifi-open-nlp Public archive

A set of NiFi processors to implement NLP flows using Apache OpenNLP

Notifications You must be signed in to change notification settings

rdlopes/nifi-open-nlp

Repository files navigation

nifi-open-nlp

A set of NiFi processors implementing Apache OpenNLP engine tools.

CI

CD

Quality

Project structure

Project has been generated using Maven archetype org.apache.nifi:nifi-processor-bundle-archetype:1.8.0

It is a Java 8 project built by Maven 3.3+ and following Maven layout conventions.

One can find a docker-compose setup to run NiFi locally with a predefined workflow, present as examples of use.

Building & running

You can build project then reuse the nar file produced in your NiFi or boot a Docker container ready to use.

From sources

Maven commands are available to build the project, using

mvn clean package

This will run the tests locally and prepare a nar file that you can drop into your current nifi install, should you have one.

Inside Docker container

Simply run the docker-compose file using

docker-compose up

Build is done inside the container, as a separate maven layer, so expect to wait a few seconds for Maven to download the internet.

Then the nar file is copied into NiFi lib/ folder and NiFi is started as a container, available on the port 8080.

The configuration directory for NiFi ($NIFI_HOME/conf or /opt/nifi/nifi-current/conf) has been mapped to the local folder ./nifi-local-data/conf.

NLP models folder

A new NiFi folder exists under $NIFI_HOME/models that contains the pre-trained models for English language:

  • en-chunker.bin
  • en-doccat-tweets.bin
  • en-ner-date.bin
  • en-ner-location.bin
  • en-ner-money.bin
  • en-ner-organization.bin
  • en-ner-percentage.bin
  • en-ner-person.bin
  • en-ner-time.bin
  • en-parser-chunking.bin
  • en-pos-maxent.bin
  • en-pos-perceptron.bin
  • en-sent.bin
  • en-token.bin
  • langdetect-183.bin

NLP training

A new NiFi folder exists under $NIFI_HOME/training that contains tweets.txt, an example of training data for sentiment analysis on tweets (see Document Categorizer) taken from this discussion on StackOverflow.

NLP model store

Another new folder under $NIFI_HOME/model-store is present and will hold the trained models for the processors.

The rationale is that processors can be trained using both model files, training files and training data so input types differ, but at the end of the day, it all ends in a model file that can be stored and reused by the processors. Lifecycle of processors training/evaluation will be explained further.

Importing from Jitpack

Add Jitpack repository in your maven project:

<repositories>
	<repository>
	    <id>jitpack.io</id>
	    <url>https://jitpack.io</url>
	</repository>
</repositories>

and the maven dependency on the github project:

<dependency>
    <groupId>com.github.rdlopes</groupId>
    <artifactId>nifi-open-nlp</artifactId>
    <version>${nifi-open-nlp.version}</version>
</dependency>

The feature is temporarily disabled, I'm waiting for GitHub feedback on few issues.

Apache NLP tools

Following tools listed in the OpenNLP developer documentation are implemented:

For further documentation, please refer to processors usage page.