This is a Go microservice using a gRPC API to expose various NLP functionality with the end goal of summarizing pieces of text.
The live gRPC API can be accessed via https://summarizer.lawrences.tech
- Segmentation - Split text into sentences and words. This can be a difficult task due to the complexity of natural language. For example, the period character is used to denote the end of a sentence, but it can also be used to denote an abbreviation or decimal number. The segmentation service will use a combination of rules and machine learning to split text into sentences and words.
- Named Entity Recognition - Identify important people, places, and things in the text. For example, in the sentence "George Washington was the first president of the United States", the named entities are "George Washington" and "United States". The named entity recognition service will use machine learning to identify named entities in the text.
- Keyword Extraction - Identify the main topics of a piece of text. Here we use TF-IDF (term frequency-inverse document frequency) to identify the most important words in the text. TF-IDF is a statistical measure that evaluates how important a word is to a document in a collection of documents by comparing the number of times the word appears in the document to the number of documents in the collection that contain the word.
- Summarization - Create a summary of the text. This can be done in two ways: extractive summarization and abstractive summarization. Extractive summarization involves selecting key sentences from the original text, while abstractive summarization involves generating new sentences that convey the main points.
To be able to use the API, you will need to download the protobuf file (api/proto/summarizer.proto) and generate the gRPC client code for your language of choice. To generate the Go client code you can simply use the script in this repository:
./scripts/genproto.sh
- A list of Go NLP libraries
- sentences - A Go library for sentence segmentation
- gse - An Efficient Go NLP library for text segmentation
- segment - A Go library for performing Unicode Text Segmentation
- prose - A library for text processing that supports tokenization, part-of-speech tagging, named-entity extraction, and more. Not maintained.
- tokenizer - Tokenizer which can be used with pre-trained models. Inspired by huggingface/tokenizers
- huggingface/tokenizers - Fast State-of-the-Art Tokenizers optimized for Research and Production. Doesn't currently support Go.