A study on Exact and Approximate Occurrences Counters
The challenge of parallel event counting in a memory efficient way is not a recent topic, but it is one still under discussion as there is great room for improvement. Most of today’s solutions perform memory optimization by applying probabilistic counters to estimate the total number of occurrences of events.
This project focuses on 2 of the most famous approximate counters to determine an estimation of the most used words of literary works from several authors in several languages and compare them to an exact counter. Conclusions drawn from the study applied to the dataset are presented in the project report.
/datasets - literary works taken from Project Gutenberg used as input data
/report - documentation of the conducted study
/results - outputs produced by the implemented code
/src - source code of the algorithms
Counter estimations of each algorithm for the top 10 words.
Counters deviations of each algorithm for the top 50 words.
$ cd src
$ pip3 install -r requirements.txt
$ python3 WordOccurrenceCounting.py
The author of this repository is Filipe Pires, and the project was developed for the Advanced Algorithms Course of the master's degree in Informatics Engineering of the University of Aveiro.
For further information read the report or contact me at [email protected].