title: Webscraping with Ruby theme: sudodoki/reveal-cleaver-theme author: name: Jaya Wijono twitter: jayzz55 url: https://github.com/Jayzz55 output: present_ruby.html controls: true
--
--
--
Grabbing the data you want from the internet
--
--
--
--
Finding grocery specials information that matters!
--
There is http://salefinder.com.au/
But this is not good enough!
--
--
require 'nokogiri'
require 'open-uri'
--
url = "http://salefinder.com.au/Woolworths-catalogue"
page = Nokogiri::HTML(open(url))
p item = page.css("span.item-details h1 a").text
--
SEE THIS IN ACTION
--
--
BUT data scrapped is not from Melbourne :(
--
--
require 'mechanize'
--
url = "http://salefinder.com.au/Woolworths-catalogue"
agent = Mechanize.new
page = agent.get(url)
--
To find the first form of the retrieved page (Mechanize::Page instance's parser method in action)
agent.page.parser.css('form')[1]
However, the Mechanize gem gives us a handy shortcut:
agent.page.forms[1]
--
To set the field using Mechanize agent instance
agent.page.forms[1]["locationSearch"] = "Melbourne, 3000"
To submit the form
agent.page.forms[1].submit
--
SEE THIS IN ACTION
--
Because of JAVASCRIPT!!
--
--
require 'capybara'
require 'capybara/poltergeist'
include Capybara::DSL
Capybara.default_driver = :poltergeist
--
Visiting the url
visit "http://salefinder.com.au/Woolworths-catalogue"
Setting location through Capybara::Session#execute_script
page.execute_script("$.cookie('postcodeId', 5188)")
page.execute_script("$.cookie('regionName', 'MELBOURNE, 3000')")
--
To parse the page:
description = item.find('span#header-region').text
--
SEE THIS IN ACTION
--
--
--
Create delayed jobs to do scraping work (in app/jobs)
class CheckCataloguesJob < ActiveJob::Base
end
Calling the jobs
CheckCataloguesJob.perform_later
--
Create task to schedule and manage jobs in Heroku / using Whenever Gem (in lib/tasks)
namespace :scraper do
desc "test running jobs"
task check_catalogues: :environment do
require './app/jobs/check_catalogues_job.rb'
CheckCataloguesJob.perform_later
end
end
--
Testing:
Use Rspec to test the job's algorithm to return expected value
BUT How do you test the web scraper is going out and scrape the data as expected?
--
Puffing Billy to the rescue!
A rewriting web proxy for testing interactions between your browser and external sites. Works with ruby + rspec.
Puffing Billy is like webmock or VCR, but for your browser.
--
WAIT A MINUTE! The scraping job is running on Poltegeist, and the spec to test this job is also running on Poltergeist.
So running a Poltergeist on top of another Poltergeist???
--
Selenium rescue the day for testing!
require 'rails_helper'
require 'spec_helper'
require 'billy/rspec'
require './app/jobs/check_catalogues_job.rb'
feature CheckCataloguesJob do
before do
@original_driver = Capybara.default_driver
Capybara.default_driver = :selenium_chrome_billy
end
after do
Capybara.default_driver = @original_driver
end
end
--
Testing the web scraping jobs is behaving as expected:
scenario 'CheckCataloguesJob scrape expected data' do
expect(CheckCataloguesJob.new.scrape_published_catalogue_nums).to eq(["8390", "8451", "8368", "8437", "8356"])
end
--
Testing the web scraping jobs is behaving as expected:
SEE THIS IN ACTION
--
Check it out:
http://savvy-mom.herokuapp.com/
https://github.com/Jayzz55/savvy_mom
--
--
Contact: