Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add scraped page archive #2

Closed
wants to merge 5 commits into from
Closed

Conversation

ondenman
Copy link
Contributor

@ondenman ondenman commented Nov 18, 2016

What does this do?

Uses scraped-page-archive to archive all pages scraped.

Why is this needed?

The country recently had an election coming up (2016-09-04). I hoped that we might be able to archive the previous term before it disappears but it looks like the site now lists the current term. As I had already begun to add the scraper, archiving the current term was only a trivial step -- at least it's now archived for the future.

Archiving it now gives us the chance to go back and re-scrape later even if it disappears.

Relevant Issue(s):

everypolitician/everypolitician-data#20544

Checklists:

Scraper change:

  • scraper is on Morph.io under the "everypolitician-scrapers" group
  • scraper's GitHub "Website" link points at morph.io page
  • scraper is set to auto-run
  • scraper is configured for archiving

Adding Archiving:

  • scraper uses scraped-page-archive gem directly or via a suitable strategy — uses it directly
  • MORPH_SCRAPER_CACHE_GITHUB_REPO_URL is configured
  • pages are being archived in new branch of correct scraper repo (yay!)

Gemfile change:

  • all links are secure
  • links to Github use github: protocol, not simply git:
  • formatting is consistent with our normal Rubocop setup

@ondenman ondenman force-pushed the add-scraped-page-archive branch from adc2395 to d134041 Compare November 21, 2016 11:53
Source was pointing to :git not :github source as defined above
@davewhiteland
Copy link

👍 to checklist, but @tmtmtmtm indicated to me the pages were the wrong ones

Copy link
Contributor

@tmtmtmtm tmtmtmtm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @davewhiteland mentions, this isn't actually archiving the correct pages, as the layout of the site has changed.

I've brought this repo up to our normal set-up separately, which means some of the changes from this can be dropped. However as 72c2b46 combines updating the Gemfile (though not the Gemfile.lock) with add the require to the scraper itself, it means it's not just a matter of dropping the unnecessary commits.

@tmtmtmtm tmtmtmtm assigned ondenman and unassigned tmtmtmtm Jan 3, 2017
@ondenman
Copy link
Contributor Author

ondenman commented Jan 4, 2017

Closing in favour of #3

@ondenman ondenman closed this Jan 4, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants