Process large chunks of text into a node tree, which can then be traversed to grab phrases that match the given criteria.
To install:
npm install frequent-phrases
The workflow is generally:
- Construct FrequentPhrase instance
- Define custom config (Optional)
- Process text
- Output frequent phrases
const FP = new FrequentPhrase();
Custom Config (more info HERE)
The default config object is as follows:
const defaultConfig = {
maxPhraseLength: 6,
selectionAlgorithm: 'dropOff',
selectionConfig: {
dropOff: {
threshold: 0.5
}
},
scoringAlgorithm: 'default',
parserConfig: {
chunkSentences: true,
removeTypedSentences: true
},
preProcessing: {
trim: 3
},
postProcessing: {
uniqueWordAtCutoffDepth: 1
}
}
Access the config property to modify this after instantiation, or construct a new config object and pass it in.
const FP = new FrequentPhrase();
FP.config = newConfigObject;
// or
FP.config.maxPhraseLength = 8; // etc.
The last bit can be separated out, or done altogether.
const speech = 'Five score years ago, a great American, in whose symbolic shadow' // ... MLK's I Have A Dream speech
To process text and then extract phrases:
await FP.process(speech);
// then get Frequent Phrases
await FP.getFrequentPhrases().then((res) => console.log(res))
To do both, just pass text in to getFrequentPhrases()
. Note that this method overwrites previous tree data, and is best served if you are instantiating a new FrequentPhrase()
everytime.
await FP.getFrequentPhrases(speech).then((res) => console.log(res));
Both methods will yield the same result:
// ^^^^^ console.log(res);
{
ok: true
msg: ''
frequentPhrases: [
{ phrase: "", score: 0 },
{ phrase: "", score: 0 },
{ phrase: "", score: 0 },
...
]
executionTime: '3.544ms'
}
To help understanding of best ways to modify for a specific use-case, the library works as follows:
- Input corpus
- Pre-process potential candidates
- Select Candidates
- Score selected Candidates
- Post-process candidates
- Output
- trim - Trims candidate pool to only originate from the top
trim
starter words. Trim defaults to0
, or no trim.
- Selection Algorithm - Algorithm to use for selection algorithm. Default is a simple dropoff, which cuts off phrases based on their relative visits between child / parent.
- Selection Config - Stores constants to modify how selection algorithms perform. See here.
Defines what scoring algorithm is used. Default algo is based solely on averaged visits, meaning a higher visit average yields higher scores.
- uniqueWordAtCutoffDepth - Trims scored candidates so that the highest-scored phrase from each starter word is represented.
- chunkSentences - convert a string into an array of it's contained sentences
- removeTypedSentences - find the unique, longest sentence amongst a gamut of typed copies of the same sentence.
- e.g.: We are only interested in the sentence 'How are you?' but we have:
- 'H'
- 'Ho'
- 'How'
- ...
- 'How are you?'
- e.g.: We are only interested in the sentence 'How are you?' but we have:
Return Frequent Phrases from data already processed.
Returns: Promise.<FP>
- Frequent phrases present in the text
Param | Description |
---|---|
body | OPTIONAL - string of text, if passed it will be processed and then phrases will be extracted. If not passed, phrases will be extracted from existing data. |
Process a string of sentences. Frequent phrases can only be extracted from processed text.
Returns: Promise<string[] | FPNode[]>
- [registry, rootNode]
Param | Description |
---|---|
body | string of text, if passed it will be processed and then phrases will be extracted. If not passed, phrases will be extracted from existing data. |
Cleans out the sentence registry and destroys the node tree
Returns: Promise<string[] | FPNode[]>
- [registry, rootNode]