Information Retrieval System

This is an information retrieval system built with Python Django using a Service Oriented Architecture. It uses PostgreSQL as the database backend and Tailwind CSS for styling.

Project Structure

The project is structured as follows:

IR contains the main Django project. This directory includes settings, URLs, and other configurations for the entire project. It serves as the central hub for managing the information retrieval system.
ir_controller contains the service responsible for controlling the main project and connecting to the remaining services. It handles communication between different components of the system and ensures smooth operation of the information retrieval system.

Additional directories represent individual services within the system. Each service is responsible for specific functionality and communicates with other services via APIs using Django REST Framework.

services are designed to be modular and communicate with each other via RESTful APIs, allowing for scalability, flexibility, and easy integration of new features or services.

Features

Service Oriented Architecture
PostgreSQL database
Tailwind CSS for styling
Word Embedding using Word2Vec
BM25 Ranking
LDA Model for Topic Detection

Techniques Used

Word Embedding using Word2Vec

Word2Vec is a popular technique for generating word embeddings, which are dense vector representations of words in a high-dimensional space. These embeddings capture semantic similarities between words based on their context in a large corpus of text. In this project, Word2Vec was used to convert words into fixed-length vectors, which were then used as features for various tasks such as semantic similarity, information retrieval, and natural language processing.

BM25 Ranking

BM25 (Best Matching 25) is a ranking function used for information retrieval. It is an improved version of the TF-IDF (Term Frequency-Inverse Document Frequency) weighting scheme that takes into account the length of the document and the average length of documents in the corpus. BM25 assigns higher weights to terms that appear infrequently in the corpus and have high discriminative power for a given query. In this project, BM25 was used to rank documents based on their relevance to user queries in the information retrieval system.

LDA Model for Topic Detection

Latent Dirichlet Allocation (LDA) is a generative probabilistic model used for topic modeling in text corpora. It represents documents as mixtures of topics, where each topic is characterized by a distribution over words. LDA is used to discover the underlying topics in a collection of documents and assign each document a distribution over these topics. In this project, LDA was used for topic detection to categorize documents into different topics and facilitate better organization and retrieval of information.

Requirements

Python 3.x
PostgreSQL
Node.js and npm (for Tailwind CSS)

Installation

Clone the Repository

git clone https://github.com/Hayan47/Information-Retrieval-System
cd yourproject

Set Up Virtual Environment

python -m venv venv
venv\Scripts\activate

Install Dependencies

pip install -r requirements.txt

Create Database

install PostgreSQL from this link. After installing PostgreSQL, create a database called mydatabase, user postgres, password admin on port 5432.

Apply Migrations

python manage.py migrate

Run the Server

python manage.py runserver

Testing the Project

Download Dataset

First, download the dataset, queries, and qrels from the following links and place them in the IR/static/datasets directory:

antique: Dataset Link, Queries Link, Qrels Link
lotte science: Dataset Link, Queries Link, Qrels Link

Ensure that the directory structure looks like this:

IR/
└── static/
└── datasets/
    ├── {dataset_name}.tsv
    ├── {dataset_name}_queries.csv
    └── {dataset_name}_qrels.csv

Contributing

Feel free to submit issues, fork the repository and send pull requests!

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
IR		IR
admin_tools		admin_tools
clustring		clustring
evaluation		evaluation
indexing		indexing
ir_controller		ir_controller
matching_and_ranking		matching_and_ranking
media		media
preprocessing		preprocessing
queryprocessing		queryprocessing
representation		representation
theme		theme
topic_modeling		topic_modeling
ui		ui
.gitignore		.gitignore
README.md		README.md
manage.py		manage.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Information Retrieval System

Project Structure

Features

Techniques Used

Word Embedding using Word2Vec

BM25 Ranking

LDA Model for Topic Detection

Requirements

Installation

Clone the Repository

Set Up Virtual Environment

Install Dependencies

Create Database

Apply Migrations

Run the Server

Testing the Project

Download Dataset

Contributing

Screenshots

About

Releases

Packages

Languages

Hayan47/Information-Retrieval-System

Folders and files

Latest commit

History

Repository files navigation

Information Retrieval System

Project Structure

Features

Techniques Used

Word Embedding using Word2Vec

BM25 Ranking

LDA Model for Topic Detection

Requirements

Installation

Clone the Repository

Set Up Virtual Environment

Install Dependencies

Create Database

Apply Migrations

Run the Server

Testing the Project

Download Dataset

Contributing

Screenshots

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages