Skip to content

Information retrieval system built with Python Django using a Service Oriented Architecture, which has Preprocessing and Data Cleaning, Matching and Ranking function, Indexing, Evaluation in addition to Word Embedding, Clustering, Topic Modeling.

Notifications You must be signed in to change notification settings

Hayan47/Information-Retrieval-System

Repository files navigation

Information Retrieval System

This is an information retrieval system built with Python Django using a Service Oriented Architecture. It uses PostgreSQL as the database backend and Tailwind CSS for styling.

Project Structure

The project is structured as follows:

  • IR contains the main Django project. This directory includes settings, URLs, and other configurations for the entire project. It serves as the central hub for managing the information retrieval system.

  • ir_controller contains the service responsible for controlling the main project and connecting to the remaining services. It handles communication between different components of the system and ensures smooth operation of the information retrieval system.

Additional directories represent individual services within the system. Each service is responsible for specific functionality and communicates with other services via APIs using Django REST Framework.

services are designed to be modular and communicate with each other via RESTful APIs, allowing for scalability, flexibility, and easy integration of new features or services.

Features

  • Service Oriented Architecture
  • PostgreSQL database
  • Tailwind CSS for styling
  • Word Embedding using Word2Vec
  • BM25 Ranking
  • LDA Model for Topic Detection

Techniques Used

Word Embedding using Word2Vec

Word2Vec is a popular technique for generating word embeddings, which are dense vector representations of words in a high-dimensional space. These embeddings capture semantic similarities between words based on their context in a large corpus of text. In this project, Word2Vec was used to convert words into fixed-length vectors, which were then used as features for various tasks such as semantic similarity, information retrieval, and natural language processing.

BM25 Ranking

BM25 (Best Matching 25) is a ranking function used for information retrieval. It is an improved version of the TF-IDF (Term Frequency-Inverse Document Frequency) weighting scheme that takes into account the length of the document and the average length of documents in the corpus. BM25 assigns higher weights to terms that appear infrequently in the corpus and have high discriminative power for a given query. In this project, BM25 was used to rank documents based on their relevance to user queries in the information retrieval system.

LDA Model for Topic Detection

Latent Dirichlet Allocation (LDA) is a generative probabilistic model used for topic modeling in text corpora. It represents documents as mixtures of topics, where each topic is characterized by a distribution over words. LDA is used to discover the underlying topics in a collection of documents and assign each document a distribution over these topics. In this project, LDA was used for topic detection to categorize documents into different topics and facilitate better organization and retrieval of information.

Requirements

  • Python 3.x
  • PostgreSQL
  • Node.js and npm (for Tailwind CSS)

Installation

Clone the Repository

git clone https://github.com/Hayan47/Information-Retrieval-System
cd yourproject

Set Up Virtual Environment

python -m venv venv
venv\Scripts\activate

Install Dependencies

pip install -r requirements.txt

Create Database

install PostgreSQL from this link. After installing PostgreSQL, create a database called mydatabase, user postgres, password admin on port 5432.

Apply Migrations

python manage.py migrate

Run the Server

python manage.py runserver

Testing the Project

Download Dataset

First, download the dataset, queries, and qrels from the following links and place them in the IR/static/datasets directory:

  • antique: Dataset Link, Queries Link, Qrels Link

  • lotte science: Dataset Link, Queries Link, Qrels Link

Ensure that the directory structure looks like this:

IR/
└── static/
└── datasets/
    ├── {dataset_name}.tsv
    ├── {dataset_name}_queries.csv
    └── {dataset_name}_qrels.csv

Contributing

Feel free to submit issues, fork the repository and send pull requests!

Screenshots

IR IR

About

Information retrieval system built with Python Django using a Service Oriented Architecture, which has Preprocessing and Data Cleaning, Matching and Ranking function, Indexing, Evaluation in addition to Word Embedding, Clustering, Topic Modeling.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published