Skip to content

Andryas/example-gcp-scrapy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Quick tutorial for scraping with GCP with no database

Long story short, the idea is to schedule a machine to start and die after webscraping. To avoid having a database, we create a JSON and insert which page crawled inside this json, and after it finishes collecting, we send it to the bucket.

  • create a VM
  • create a schedule for the VM usign crontab
  • create a bucket, get this bucket name and use in the next step
  • create a file .env in this repository and set BUCKET=bucket-name
  • send this repository to the vm,
    • sudo bash setup.sh
    • config crontab, for instance, 15 12 * * MON bash /home/wavrzenczak/scraping/crawl.sh test
  • Note that the crontab for schedule machine need to be close to the crontab setting in the VM, I'd put 15 minutes from each other just in case.

Now, see how is create the src/spiders/test.py, create another one following the structure, pipeline.py, and be happy.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published