Bachelor of Engineering thesis by Paweł Rzepiński and Ryszard Szymański under the supervision of Agnieszka Jastrzębska, Ph.D. Eng. The objective was to develop a recommender system for books using novel dataset. Both collaborative-filtering and content-based approaches were considered. Implemented recommendation models were accessible using web application allowing users to explore the dataset and compare results for both approaches: "Similar books to X" panel presenting items similar to the selected book and "You may also like X, Y, Z" containing recommendations based on books rated by the selected user.
Full showcase video available at Google Drive.
Thesis folder contains both thesis and abstract.
Documentation of the recommendation module can be found in the docs
folder. Main page is located at docs/_build/html/index.html
.
├── Makefile <- Makefile with commands like `make data`, `make models`, `make scores`
├── README.md <- The top-level README for developers using this project.
├── data
│ ├── external <- Data from third party sources.
│ ├── interim <- Intermediate data that has been transformed.
│ ├── processed <- The final, canonical data sets for modeling.
│ └── raw <- The original, immutable data dump.
│
├── docs <- Codebase documentation.
│
├── models <- Trained and serialized models, model predictions.
│
├── notebooks <- Jupyter notebooks. Naming convention is a number (for ordering),
│ the creator's initials, and a short `-` delimited description, e.g.
│ `1-rzepinskip-initial-data-exploration`.
│
├── references <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports <- Generated analysis as PDF files.
│ └── figures <- Generated graphics and figures to be used in reporting
│
├── requirements.txt <- The requirements file for the web application.
├── requirements-dev.txt <- The requirements file for reproducing the analysis environment.
│
├── setup.py <- Project's main module. Can be installed with pip command.
└── booksuggest <- Source code of the recommendation module.
│
├── data <- Scripts to download or generate data.
│
├── features <- Scripts to turn raw data into features for modeling.
│
├── models <- Scripts to train models and then use trained models to make
│ predictions.
│
└── evaluation <- Scripts to evaluate scores and validate results against ground-truth data.
All commands mentioned below should be run from the project's root folder. Use make help
to display help information about available commands.
- UNIX based system
- GNU Make
- Python 3.7
- pip
- Create a virtual environment:
make create_environment
- Activate the virtual environment:
source rs-venv/bin/activate
- Install packages required by the web application:
make app_requirements
- Run the app:
make app
- Enter the web page address displayed in the console. Web application should be accessible at http://127.0.0.1:8050/.
- Create a virtual environment:
make create_environment
- Activate the virtual environment:
source rs-venv/bin/activate
- Install packages required for development:
make requirements
- Download the raw data:
make data
- Train models:
make models
- Evaluate models:
make scores
Comments:
- When using the whole dataset the
make models
command takes about 20 minutes,make scores
lasts more than 12h. - To check pipeline on the small subset of data use
TEST_RUN=1
parameter when running make commands. Then, the whole process should take about 5 minutes. Example:make scores TEST_RUN=1
- To utilize make's parallelization use
-j <n_jobs>
parameter where<n_jobs>
specifies the number of parallel jobs run. Most often,n_jobs
should be equal to the number of cores in the processor, although there are also some RAM requirements when using whole dataset. Example:make scores -j 2
- Dataset used in the project goodbooks10k.
- Recommendation methods mostly from surprise library.
- Project structure based on the cookiecutter data science project template.
- Thesis based on latex-mimosis template by Bastian Rieck.