Add scraped page archive #2

ondenman · 2016-11-18T09:55:07Z

What does this do?

Uses scraped-page-archive to archive all pages scraped.

Why is this needed?

The country recently had an election coming up (2016-09-04). I hoped that we might be able to archive the previous term before it disappears but it looks like the site now lists the current term. As I had already begun to add the scraper, archiving the current term was only a trivial step -- at least it's now archived for the future.

Archiving it now gives us the chance to go back and re-scrape later even if it disappears.

Relevant Issue(s):

everypolitician/everypolitician-data#20544

Checklists:

Scraper change:

scraper is on Morph.io under the "everypolitician-scrapers" group
scraper's GitHub "Website" link points at morph.io page
scraper is set to auto-run
scraper is configured for archiving

Adding Archiving:

scraper uses scraped-page-archive gem directly or via a suitable strategy — uses it directly
MORPH_SCRAPER_CACHE_GITHUB_REPO_URL is configured
pages are being archived in new branch of correct scraper repo (yay!)

Gemfile change:

all links are secure
links to Github use github: protocol, not simply git:
formatting is consistent with our normal Rubocop setup

Source was pointing to :git not :github source as defined above

davewhiteland · 2016-12-28T17:47:56Z

👍 to checklist, but @tmtmtmtm indicated to me the pages were the wrong ones

tmtmtmtm

As @davewhiteland mentions, this isn't actually archiving the correct pages, as the layout of the site has changed.

I've brought this repo up to our normal set-up separately, which means some of the changes from this can be dropped. However as 72c2b46 combines updating the Gemfile (though not the Gemfile.lock) with add the require to the scraper itself, it means it's not just a matter of dropping the unnecessary commits.

ondenman · 2017-01-04T10:03:43Z

Closing in favour of #3

ondenman added 2 commits November 18, 2016 09:32

Bundle install

15466c1

Require scraped page archive

72c2b46

ondenman assigned davewhiteland Nov 18, 2016

Tidy Gemfile for Rubocop

d134041

ondenman force-pushed the add-scraped-page-archive branch from adc2395 to d134041 Compare November 21, 2016 11:53

ondenman added 2 commits November 21, 2016 12:35

Fix Github gem source

0914b24

Source was pointing to :git not :github source as defined above

Update Gemfile.lock with bundle install

abacf09

davewhiteland assigned tmtmtmtm and unassigned davewhiteland Dec 28, 2016

tmtmtmtm suggested changes Jan 3, 2017

View reviewed changes

tmtmtmtm assigned ondenman and unassigned tmtmtmtm Jan 3, 2017

ondenman closed this Jan 4, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add scraped page archive #2

Add scraped page archive #2

ondenman commented Nov 18, 2016 •

edited

Loading

davewhiteland commented Dec 28, 2016

tmtmtmtm left a comment

ondenman commented Jan 4, 2017

Add scraped page archive #2

Add scraped page archive #2

Conversation

ondenman commented Nov 18, 2016 • edited Loading

What does this do?

Why is this needed?

Relevant Issue(s):

Checklists:

Scraper change:

Adding Archiving:

Gemfile change:

davewhiteland commented Dec 28, 2016

tmtmtmtm left a comment

Choose a reason for hiding this comment

ondenman commented Jan 4, 2017

ondenman commented Nov 18, 2016 •

edited

Loading