Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add scraped page archive #1

Merged
merged 11 commits into from
Nov 23, 2016
Merged

Add scraped page archive #1

merged 11 commits into from
Nov 23, 2016

Conversation

ondenman
Copy link
Contributor

@ondenman ondenman commented Nov 15, 2016

What does this do?

Uses scraped-page-archive to archive all pages scraped.

Why is this needed?

There's an election coming up (01/12/2016), and it's likely that the data on the official site will disappear, meaning any data we're not already picking up will be lost. Archiving it now gives us the chance to go back and re-scrape later even if it disappears.

Relevant Issue(s):

everypolitician/everypolitician-data#20544

Checklists:

Scraper change:

  • scraper is on Morph.io under the "everypolitician-scrapers" group — no, it's still at https://morph.io/tmtmtmtm/gabon-deputes, but there's little point in moving it this close to the election.
  • scraper's GitHub "Website" link points at morph.io page
  • scraper is set to auto-run — not yet: @tmtmtmtm This needs to be set on Morph.
  • scraper is configured for archiving — that's what this PR is doing!

Adding Archiving:

  • scraper uses scraped-page-archive gem directly or via a suitable strategy — uses it directly
  • MORPH_SCRAPER_CACHE_GITHUB_REPO_URL is configured — not yet. @tmtmtmtm this needs configured on morph
  • pages are being archived in new branch of correct scraper repo (yay!)

Gemfile change:

  • all links are secure
  • links to Github use github: protocol, not simply git:
  • formatting is consistent with our normal Rubocop setup (See commit message.)

Copy link
Contributor

@tmtmtmtm tmtmtmtm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ondenman ondenman force-pushed the add-scraped-page-archive branch from 4066bf8 to 737f9ac Compare November 18, 2016 10:45
@tmtmtmtm
Copy link
Contributor

@ondenman the rubocop tidying only needs to be for the Gemfile, not the scraper.rb. As the autofix actually changes the logic in the scraper (by hoisting the guard clause), I'm a little hesitant to include this here. I'd suggest dropping 737f9ac, and rerunning separately against only the Gemfile.

(NB you don't necessarily need to actually include Rubocop+config in these sort of changes, but there's no harm in doing so, so now that it's already here it's OK to keep it)

@tmtmtmtm tmtmtmtm removed their assignment Nov 21, 2016
@ondenman ondenman force-pushed the add-scraped-page-archive branch from 737f9ac to cd62b98 Compare November 21, 2016 10:45
@ondenman ondenman assigned tmtmtmtm and unassigned ondenman Nov 21, 2016
ondenman and others added 4 commits November 21, 2016 11:06
Source was pointing to :git not :github source as defined above
The only two relevant pages I could find from the outgoing legislature
were the Commissions and the list of female deputies.
@tmtmtmtm
Copy link
Contributor

The PR description here is a bit misleading — this isn't really archiving any pages, only a PDF file which presumably is never going to change. As such this change probably isn't that useful (other than a single run to archive that PDF once).

However, I suspect we should also archive (though not process) http://www.assemblee-nationale.ga/34-deputes/168-bureaux-des-commissions/ and http://www.assemblee-nationale.ga/34-deputes/153-les-femmes-deputes/

I've added an extra commit to pick up those two pages.

@tmtmtmtm
Copy link
Contributor

I've configured morph, though I haven't set it to run every day, I doubt it's actually going to change again between now and the election.

@tmtmtmtm tmtmtmtm merged commit 18fed99 into master Nov 23, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants