Semantica is a cutting-edge semantic search engine built using Streamlit, Elasticsearch, and Sentence Transformers. It allows users to perform context-aware searches by leveraging vector embeddings and k-Nearest Neighbors (kNN) search capabilities. Designed for developers and researchers, Semantica provides a user-friendly interface for dataset ingestion, embedding generation, and semantic search.
- Dataset Upload: Upload CSV files with text and metadata for processing.
- Model Selection: Choose from pre-trained Sentence Transformer models for embedding generation.
- Vector Embedding: Transform text data into dense vector embeddings to capture semantic meaning.
- Indexing: Efficiently store data in Elasticsearch for fast retrieval.
- Semantic Search: Perform context-aware queries using vector similarity.
- Result Export: Export search results in CSV, Excel, or PDF formats.
- Customizable UI: Dark-themed, modern UI with user-friendly controls.
-
Frontend:
- Built with Streamlit for interactivity.
- Enables dataset upload, model selection, and query submission.
-
Backend:
- Elasticsearch for data storage and kNN-based similarity search.
-
Machine Learning:
- Sentence Transformers for generating vector embeddings from text data.
-
Data Handling:
- Pandas for preprocessing and managing uploaded datasets.
-
Upload Dataset:
- Upload a CSV file containing text data and metadata.
- Preview the uploaded dataset in the app.
-
Model Selection:
- Select a Sentence Transformer model (e.g.,
paraphrase-MiniLM-L6-v2
,all-mpnet-base-v2
).
- Select a Sentence Transformer model (e.g.,
-
Embedding and Indexing:
- Generate embeddings for the selected text column.
- Index the data into Elasticsearch, including metadata and dense vectors.
-
Search and Retrieve:
- Enter a search query, which is transformed into a vector.
- Retrieve the top results based on vector similarity using kNN search.
-
Export Results:
- Download search results in CSV, Excel, or PDF formats.
Semantica/
├── app.py # Main application script for Streamlit.
├── custom.css # Custom CSS file for styling the app.
├── report_generator.py # Script for exporting results in various formats.
├── requirements.txt # List of dependencies for the project.
├── .env # Environment variables (optional, for local use).
└── elasticsearch.yml # Elasticsearch configuration (if self-hosted).
- Python 3.7 or higher
- Elasticsearch (local or cloud-hosted)
- Virtual Environment (recommended)
git clone https://github.com/Milind-Palaria/Semantica---A-Semantic-Search-Engine.git
cd semantica
python -m venv sementicaVENV
source sementicaVENV/bin/activate # For Linux/Mac
sementicaVENV\Scripts\activate # For Windows
pip install -r requirements.txt
-
Option 1: Elastic Cloud
- Sign up at Elastic Cloud.
- Obtain the
ES_ENDPOINT
,ES_USERNAME
, andES_PASSWORD
.
-
Option 2: Local Elasticsearch
- Install Elasticsearch locally.
- Run it with Docker:
docker run -d --name elasticsearch -p 9200:9200 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:8.10.2
- Create a
.env
file:ES_ENDPOINT=https://your-elasticsearch-endpoint ES_USERNAME=elastic ES_PASSWORD=your-password
streamlit run app.py
- Access the app locally at
http://localhost:8501
.
- Ensure Elasticsearch is running and accessible.
- The app will notify you if the connection is successful.
- Upload a CSV file with at least one text column.
- Preview the dataset and select the relevant columns for processing.
- Select a pre-trained Sentence Transformer model.
- Click the "Process and Index Dataset" button to generate embeddings and store the data.
- Enter a search query in plain text.
- View the top results ranked by semantic similarity.
- Download the search results as CSV, Excel, or PDF files.
Contributions are welcome! To contribute:
- Fork the repository.
- Create a feature branch.
- Commit your changes.
- Submit a pull request.
This project is licensed under the MIT License. See the LICENSE file for details.
- Author: Milind Palaria
- Email: [email protected]
- GitHub: https://github.com/Milind-Palaria