A simple text clustering algorithm in c#.
It will add the extension method ClusterBy on IEnumerable. You only need to specify which string property to use and some options.
.NET Standard 2.0
To get the latest version:
Install-Package TextClustering
Consider the following model:
public class Document
{
public string Content { get; set; }
}
How to invoke it:
using TextClustering;
// ...
var documents = new List<Document>();
// Fill list of documents.
var result = documents.ClusterBy(document => document.Content, options => options
.WithMinClusterSize(5) // The minimum cluster size (default value: 5, but you should change it)
.WithMinWordLength(5) // The minimum word length
.WithMaxPresencePercent(10) // The maximum overall presence in percent of one word among all text
.UseCaching(true) // (optional, true by default. Will use more ram, but prevent redoing the same calculation multiple times)
.WithMaxDegreeOfParallelism(Environment.ProcessorCount) // (optional, will use one thread by default)
.WithLanguages(Language.English, Language.French) // (optional, will use English stop words if not specified) This is used to eliminate words that are so commonly used that they carry very little useful information.
);
// result.Unclassified // List<Document>
// result.Clusters // List<List<Document>>
For more complete example, please see the project TextClusteringExample.
Code released under the MIT license.