A research paper recommendation engine that uses text embeddings and Gaussian Process Regression (GPR) to suggest relevant papers based on your preferences.
Paper Recommender helps researchers discover relevant papers by:
- Fetching recent papers from arXiv
- Using text embeddings to represent paper content
- Learning your preferences through an onboarding process
- Recommending papers using a variance-based Gaussian Process Regression model
- Balancing exploration and exploitation to improve recommendations over time
The system uses a sophisticated approach where the GPR model learns the expected variance between ratings as a function of similarity, rather than directly predicting ratings. This provides more robust recommendations with better uncertainty estimates.
- Python 3.8 or higher
- Ollama for text embeddings
-
Clone the repository:
git clone https://github.com/yourusername/paper-recommender.git cd paper-recommender
-
Install the package:
pip install -e .
The paper recommender uses Ollama with the nomic-embed-text model for generating text embeddings.
-
Install Ollama by following the instructions at ollama.ai
-
Pull the nomic-embed-text model:
ollama pull nomic-embed-text
-
Verify the installation:
ollama run nomic-embed-text "Hello, world!"
-
Ensure the Ollama server is running when using Paper Recommender:
ollama serve
Paper Recommender uses a configuration file located at ~/.paper_recommender/config.json
. The default configuration is created on first run, but you can customize it:
{
"chroma_db_path": "~/.paper_recommender/chroma_db",
"model_path": "~/.paper_recommender/gp_model.pkl",
"embedding_cache_path": "~/.paper_recommender/embedding_cache.pkl",
"exploration_weight": 1.0,
"max_samples": 1000,
"period_hours": 48,
"random_sample_size": 5,
"diverse_sample_size": 5,
"num_recommendations": 5,
"gp_num_samples": 100,
"n_nearest_embeddings": 10
}
Key configuration parameters:
exploration_weight
: Controls the balance between exploration and exploitation (higher values favor exploration)period_hours
: Time window for fetching recent papers from arXivrandom_sample_size
anddiverse_sample_size
: Number of papers to show during onboardinggp_num_samples
: Number of GP samples for uncertainty estimationn_nearest_embeddings
: Number of nearest embeddings to use for prediction
On first run, the system will automatically start the onboarding process:
paper-recommender
The onboarding process helps the system learn your preferences by asking you to rate papers:
paper-recommender --onboard
During onboarding:
- You'll be presented with papers selected using different strategies (random and diverse)
- Rate papers on a scale of 1-5
- The system will use these ratings to build a recommendation model
After onboarding, you can get paper recommendations:
paper-recommender --recommend
The recommendation process:
- Fetches recent papers from arXiv
- Uses the trained model to predict your ratings
- Presents papers with the highest predicted ratings
- Allows you to rate the recommended papers to improve future recommendations
If you want to retrain the recommendation model with your latest ratings:
paper-recommender --bootstrap
usage: paper-recommender [-h] [--onboard] [--recommend] [--bootstrap]
[--config CONFIG] [--chroma-db-path CHROMA_DB_PATH]
[--model-path MODEL_PATH]
[--embedding-cache-path EMBEDDING_CACHE_PATH]
[--exploration-weight EXPLORATION_WEIGHT]
[--max-samples MAX_SAMPLES]
[--period-hours PERIOD_HOURS]
[--random-sample-size RANDOM_SAMPLE_SIZE]
[--diverse-sample-size DIVERSE_SAMPLE_SIZE]
[--num-recommendations NUM_RECOMMENDATIONS]
Paper Recommender
optional arguments:
-h, --help show this help message and exit
--onboard Run onboarding process
--recommend Run recommendation process
--bootstrap Bootstrap the recommendation model
--config CONFIG Path to custom config file
--chroma-db-path CHROMA_DB_PATH
Path to ChromaDB directory
--model-path MODEL_PATH
Path to model pickle file
--embedding-cache-path EMBEDDING_CACHE_PATH
Path to embedding cache file
--exploration-weight EXPLORATION_WEIGHT
Exploration weight for recommendations
--max-samples MAX_SAMPLES
Maximum number of samples for similarity search
--period-hours PERIOD_HOURS
Time period in hours for paper retrieval
--random-sample-size RANDOM_SAMPLE_SIZE
Number of random papers to select during onboarding
--diverse-sample-size DIVERSE_SAMPLE_SIZE
Number of diverse papers to select during onboarding
--num-recommendations NUM_RECOMMENDATIONS
Number of recommendations to show
Paper Recommender fetches recent papers from arXiv based on the configured time period. It uses the arXiv API to retrieve paper titles, abstracts, and links.
The system uses the nomic-embed-text model through Ollama to generate vector embeddings for paper content. These embeddings capture the semantic meaning of papers, allowing the system to find similar papers.
Embeddings and ratings are stored in a ChromaDB vector database, enabling efficient similarity search.
The recommendation system uses a sophisticated Gaussian Process Regression (GPR) model that:
- Learns to predict the variance between ratings as a function of similarity
- Uses this variance to weight similar papers when making predictions
- Samples the GP model to estimate uncertainty
- Uses N-sigma confidence intervals to balance exploration and exploitation
This approach provides more robust recommendations with better uncertainty estimates compared to directly predicting ratings.
- Ollama Connection Error: Ensure the Ollama server is running with
ollama serve
- Missing Model: If you get an error about the nomic-embed-text model, run
ollama pull nomic-embed-text
- No Recommendations: You may need to onboard more papers before getting recommendations
If you encounter any bugs or have feature requests, please open an issue on the GitHub repository.
This project is licensed under the terms of the LICENSE file included in the repository.
Contributions are welcome! Please feel free to submit a Pull Request.