This project implements a lightweight full-text search engine in Python, featuring TF-IDF (Term Frequency-Inverse Document Frequency) ranking. It is an exercise to understand the fundamental concepts and algorithms that power commercial full-text search engines. Inspired by this blog post.
Each document is represented by a Document
dataclass, which encapsulates metadata such as the title, abstract, URL, and a computed term frequency map. The term frequency map is generated by analyzing the document’s full text, which combines the title and abstract. The analysis process involves tokenizing the text, lowercasing tokens, removing punctuation, filtering out stopwords, and applying stemming to normalize words. The processed tokens are then used to build an inverted index, a data structure that maps each token to the set of documents containing it. This inverted index enables fast retrieval of documents that match query terms.
For more refined results, TF-IDF ranking considers both how often a term appears in a document (term frequency) and how rare the term is across the entire corpus (inverse document frequency). Documents with higher TF-IDF scores are considered more relevant and are returned at the top of the search results.
Please note that you need to download the Wikipedia abstracts data and place it in the data directory. You can download the data that I used from here.
git clone https://github.com/adorabilis/minisearch.git && cd minisearch
Activate virtual environment and install dependencies.
# Using uv
uv venv && uv pip install -r requirements.txt
# Without uv
python -m venv .venv && source venv/bin/activate && pip install -r requirements.txt
Run main.py
.
# Using uv
uv tool run main.py
# Without uv
python main.py
Sample output:
Parsing XML...
Parsing XML took 8.376624822616577 seconds
Wikipedia: Open letter on artificial intelligence (2015) - 20.459650102717887
Wikipedia: Existential risk from artificial intelligence - 20.459650102717887
Wikipedia: March of the Machines - 13.639766735145258
Wikipedia: List of artificial intelligence films - 13.639766735145258
Wikipedia: GOFAI - 13.639766735145258
Wikipedia: One Voice Technologies - 13.226159094217817
Wikipedia: Visual computing - 10.023021230895223
Wikipedia: Constructive cooperative coevolution - 6.819883367572629
Wikipedia: Maja Petrić - 6.819883367572629
Wikipedia: Future Soldier 2030 Initiative - 6.819883367572629
Wikipedia: Digital labor - 6.819883367572629
Wikipedia: Daniel Weitzner - 6.819883367572629
- Weighted ranking (rank title matches higher)
- Better query parsing (support for AND, OR, NOT operators)
- Persistent index (to avoid having to re-index every run)