-
-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatically create annotations in database from old analyst spreadsheets #141
Comments
Couple questions:
|
Ah! This might need some explanation. Changes aren’t meant to be writable, and the only information they store is a denormalized version of what’s in the annotations that are attached to them (i.e. it’s just a shortcut for easier access or database indexing). We do expect changes to have multiple annotations. Changes have a They also have
Actually, I think that info should be API-visible. An annotation is meant to be any old pile of JSON object (with the exception that For example, the older annotations from 2017 have totally different fields and formats, and a UI for exploring our annotations/changes would want to know how best to display a given annotation or how to present it for editing. A given field name might be best displayed with a dropdown or radio button list, so it might be helpful to have something like the type or version of the annotation so the UI knows how to treat it. |
This adds a script that will read a CSV matching the format of our analysts’ current sheets and create annotations in the database for each row. Run it like: scripts/annotations_import <PATH_TO_CSV> Add the `--is_important_changes` option if the sheet represents “important” changes. This mainly affects how the `significance` field is calculated. This is a component of edgi-govdata-archiving/web-monitoring#141
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions. |
This adds a script that will read a CSV matching the format of our analysts’ current sheets and create annotations in the database for each row. Run it like: scripts/annotations_import <PATH_TO_CSV> Add the `--is_important_changes` option if the sheet represents “important” changes. This mainly affects how the `significance` field is calculated. This is a component of edgi-govdata-archiving/web-monitoring#141
While the web-monitoring-db project has the ability to store annotations (mostly free-form information from a human or bot about what exactly has changed between two versions of a page), the analyst team doesn’t currently make use of it. We’d like to start surfacing annotation information in the UI, and one way to do so is to import the annotations they currently make in spreadsheets into the database.
(There was some previous discussion on this in edgi-govdata-archiving/web-monitoring-db#61, but I’ve made this issue to be a bit fresher and more concise.)
Analysts currently have spreadsheets formatted with the following columns:
significance
in the database annotation, which is a number between 0-1. There’s a lot of room for interpretation here, but I’m thinkinglow = 0.5, medium = 0.75, high = 1.0
. (This column is only in the important changes sheet, hence starting at0.5
even a low importance is still somewhat significant just by virtue of being in this particular spreadsheet.)The biggest thing to note here is that, because the way we calculate the value for a lot of fields has changed over time, you should probably use either the “This Period - Side by Side” or “Latest to Base - Side by Side” columns to determine the page and version IDs to annotate. They will always be in the format:
Where
ABC
is the Page’s UUID andDEF..GHI
is the change ID. To break that down a bit more, a change IDDEF..GHI
indicates the change between the version with UUIDDEF
and the version with UUIDGHI
. SometimesDEF
will be missing, so a change ID could be..GHI
. That means the change between the version immediately precedingGHI
and the version with UUIDGHI
.Over on the database/API side of things, we have:
We want to take each row from an analyst spreadsheet, look up the relevant Versions (or really, the relevant Change), and create an annotation with all the data from the last several rows of the sheet.
I can see this written either as a Rake task inside web-monitoring-db or as a Python script using the tools in web-monitoring-processing’s db.py module:
If written inside web-monitoring-db, it has direct access to the database. You can look up a change with
Change.find_by_api_id('DEF..GHI')
. It’ll throw aRecordNotFound
error if there is a problem with the IDs.Then you can create the annotation with:
If written as a Python script, you’ll have to use the public API to create the annotation:
Or using the Python DB wrapper:
Other notes and caveats:
annotation_version
field or something in the annotation’s data so tools reading the data back out later know how to treat it.significance
field in the annotation will have to be treated differently for individual analyst sheets vs. the “important changes” sheet:author
for the annotation, but that’s a bit complicated so we should skip it for the moment.The text was updated successfully, but these errors were encountered: