Quick tutorial for scraping with GCP with no database

Long story short, the idea is to schedule a machine to start and die after webscraping. To avoid having a database, we create a JSON and insert which page crawled inside this json, and after it finishes collecting, we send it to the bucket.

create a VM
create a schedule for the VM usign crontab
create a bucket, get this bucket name and use in the next step
create a file .env in this repository and set BUCKET=bucket-name
send this repository to the vm,
- sudo bash setup.sh
- config crontab, for instance, 15 12 * * MON bash /home/wavrzenczak/scraping/crawl.sh test
Note that the crontab for schedule machine need to be close to the crontab setting in the VM, I'd put 15 minutes from each other just in case.

Now, see how is create the src/spiders/test.py, create another one following the structure, pipeline.py, and be happy.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src		src
.gitignore		.gitignore
README.md		README.md
connect.sh		connect.sh
crawl.sh		crawl.sh
deploy.sh		deploy.sh
requirements.txt		requirements.txt
scrapy.cfg		scrapy.cfg
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Quick tutorial for scraping with GCP with no database

About

Releases

Packages

Languages

Andryas/example-gcp-scrapy

Folders and files

Latest commit

History

Repository files navigation

Quick tutorial for scraping with GCP with no database

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages