This project is a Python-based web scraping and data visualization project using the Scrapy, pymongo, and Dash libraries. The goal is to scrape data from books.toscrape.com, store it in a MongoDB database, and create a dashboard to search and visualize the scraped book data.
- Installation
- Usage
- Project Structure
- Scraping
- Dashboard
- Automatic execution with anacron
- Contributing
-
Clone the repository:
git clone [email protected]:kstarkiller/simplon_brief08_web_scrapping.git
-
Navigate to the project directory:
cd simplon_brief08_web_scraping (or whatever you named this project)
-
Install the required dependencies:
pip install -r dash/requirements.txt
To scrape data from books.toscrape.com, you first have to install MongoDB and create your user (if not already done): without this step you won't be able to run scraping process. See instructions to install MongoDB and create an user below
In spider.py and main.py, replace MONGO_USR and MONGO_PWD variables by your MongoDB username and password (see lines 14 and 15 of the spider.py file and line 23 of the main.py)
Then you can run the following command: scrapy crawl myspider
This command will initiate the scraping process and save the scraped data in 'books' collection of 'scraped_books' database.
To launch the dashboard, run the following command: python dash/main.py
Visit http://localhost:8050/ in your web browser to interact with the dashboard. You can search for books and visualize some data points.
To crawling books.toscrape.com website every day (monday to friday) automatically go the ~/etc/anacrontab and write this line and the end of the file :
1 15 spider [1-5] "~/path/to/your/file"
What means this line ?
- 1: Specifies that the command should be executed every day.
- 15: Represents the delay in minutes.
- spider: Is the identifier for the job.
- [1-5]: This specifies the range of days from Monday (1) to Friday (5). So, the command will only be executed on days 1 through 5 (Monday to Friday).
- /path/to/your/file: Is the path to the program or script that will be executed.
simplon_brief08_web_scraping/
├── scraping/
│ ├── myproject/
│ │ ├── spiders/
│ │ │ ├── _init_.py
│ │ │ └── spider.py
│ │ ├── _init_.py
│ │ ├── items.py
│ │ ├── middlewares.py
│ │ ├── pipelines.py
│ │ └── settings.py
│ └── scrapy.cgf
├── dash/
│ ├── assets
│ │ ├── icons/
│ │ │ ├── logo.svg
│ │ │ └── search_icon.svg
│ │ └── style.css
│ ├── main.py
│ └── requirements.txt
├── api/
│ └── traduction_api.py
├── .gitignore
└── README.md
- scraping/ : Contains the Scrapy spider for scraping book data.
- spider.py: Script to scrape data and store them in a MongoDB database.
- main.py: Script to create and launch the Dash dashboard.
- requirements.txt: Lists the required Python libraries and versions.
- README.md: Project documentation.
Feel free to contribute to this project by opening issues or submitting pull requests. Your input is highly appreciated.
This guide will walk you through the process of installing MongoDB on Ubuntu and creating a user with read and write permissions.
sudo apt update
sudo apt install -y mongodb
sudo systemctl start mongod
sudo systemctl enable mongod
mongosh
use admin
Replace username and password with your desired values.
db.createUser({
user: "username",
pwd: "password",
roles: [
{ role: "readWrite", db: "your_database" }
]
})
Make sure to replace your_database with the name of the database you want to grant read and write permissions to.
exit
mongo -u username -p password --authenticationDatabase your_database
Replace username, password, and your_database with your specified values.
use your_database
db.getCollectionNames()
This should display the collections in your database, confirming that the user has read and write permissions.