UCLDC Harvesting operations guide

Harvesting infrastructure components

Consult the harvesting infrastructure diagram for an illustration of the key components. Ask Mark Redar for access to them; note that you will need to log onto the blackstar machine to run commands, using these Putty connection instructions (on Sharepoint)

Collection Registry
ingest front machine (stage - harvest-stg.cdlib.org) and ingest front machine (production - harvest-prd.cdlib.org), for proxy access to:
RQ Dashboard
CouchDB stage
CouchDB production
Solr stage
Solr production
Elastic Beanstalk
AWS CloudWatch

As of February 2016, the process to publish a collection to production is as follows:

Create collection, add harvest URL & mapping/enrichment chain
Select "Queue harvest for collection on normal queue" on the registry page for the collection
Check that there is a worker listening on the queue. If not start one. Stage Worker
Wait until the harvest job finishes, hopefully without error. Now the collection has been harvested to the stage CouchDB.
The first round of QA in CouchDB can be performed there CouchDB stage
Push the new CouchDB docs into the stage Solr index. Select "Queue sync solr index for collection(s) on normal-stage" on the registry page for the colleciotn Updating Solr
QA stage Solr index in the public interface Solr stage
When ready to publish to production, edit Collection in the registry and check the "Ready for publication" box and save.
Select the "Queue sync to production couchdb for collection" Syncing CouchDB
Check that there is a worker in the production environment listening on the normal prod queue, if not start one. Production Worker
Wait until the sync job finishes. Now the collection has been harvested to the production CouchDB.
Sync the new docs to the production Solr by starting the sync from the registry for the new collections. At this point the Collection is in the new, candidate Calisphere Solr index
Once QA is done on the candidate index and ready to push new one to Calisphere, push the index to S3
Clone the existing Solr API Elastic Beanstalk and point to the packaged index on S3
Swap the URL from the older Solr API Elastic Beanstalk and the new Elastic Beanstalk.

UCLDC Harvesting operations guide

1. Managing workers to process harvesting jobs
1.1. Start stage workers
1.2. Checking the status of a worker
1.3. Stop or terminate stage worker instances
2. Run harvest jobs: non-Nuxeo sources
2.1. New harvest or re-harvest?
2.2. Harvest metadata to CouchDB stage
2.3. Harvest preview and thumbnail images
3. Run harvest jobs: Nuxeo
3.1. New harvest or re-harvest?
3.2. Harvest and process access files from Nuxeo ("deep harvesting")
3.3. Harvest metadata to CouchDB stage
3.4. Harvest preview image, also used for thumbnails
4. QA check collection in CouchDB stage
4.1. Check the number of records in CouchDB
4.2. Additional QA checking
5. Sync CouchDB stage to Solr stage
6. QA check collection in Solr stage
7. QA check in Calisphere stage UI

Moving a harvest to production

8. Manage workers to process harvesting jobs
9. Sync the collection from CouchDB stage to CouchDB production
10. Sync the collection from CouchDB production to Solr production
11. QA check candidate Solr index in Calisphere UI
12. Generate and review QA report for candidate Solr index

Updating Elastic Beanstalk with candidate Solr index

Removing items or collections (takedown requests)

Restoring collections from production

Additional resources

Running long processes
Picking up new harvester or ingest code
Recreating the Solr Index from scratch
How to find a CouchDB source document for an item in Calisphere
Creating/Harvesting with High Stage Workers

Fixes for Common Problems

What to do when harvests fail
Image problems

[Addendum: Creating new AMI images - Developers only]

User accounts

Adding a monitoring user (one time set up)

pull the ucldc/ingest_deploy project Get the ansible vault password from Mark. It's easiest if you create a file (perhaps ~~/.vault-password-file) to store it in and alias ansible-playbook to ansible-playbook --vault-password-file=~~/.vault-password-file. Set mode to 600)

create an htdigest entry by running

htdigest -c tmp.pswd ingest <username>

Will prompt for password that is easy to generate with pwgen. copy the line in tmp.pswd

Then run:

ansible-vault --vault-password-file=~/.vault-password-file
  ingest_deploy/ansible/roles/ingest_front/vars/digest_auth_users.yml

Entries in this file are htdigest lines, preceded by a - to make a yaml list. eg:

---
digest_auth_users:
  - "u1:ingest:435srrr3db7b180366ce7e653493ca39"
  - "u1:ingest:rrrr756e5aacde0262130e79a888888c"
  - "u2:ingest:rrrr1cd0cd7rrr7a7839a5c1450bb8bc"

From a machine that can already access the ingest front machine with ssh run:

ansible-playbook -i hosts --vault-password-file=~/.vault_pass_ingest provision_front.yml

This will install the users.digest to allow access for the monitoring user.

Adding an admin user (one time set up)

add your public ssh to keys file in https://github.com/ucldc/appstrap/tree/master/cdl/ucldc-operator-keys.txt

From a machine that can already access the ingest front machine with ssh run:

ansible-playbook -i hosts --vault-password-file=~/.vault_pass_ingest provision_front.yml

This will add your public key to the ~/.ssh/authorized_keys for the ec2-user on the ingest front machine.

Preliminary steps: add collection to the Collection Registry and define harvesting endpoint

The first step in the harvesting process is to add the collection(s) for harvesting into the Collection Registry. This process is described further in Section 8 of our OAC/Calisphere Operations and Maintenance Procedures.

When establishing the entries, you'll need to determine the harvesting endpoint: Nuxeo, OAC, or an external source.

Conducting a harvest to stage

1. Managing workers to process harvesting jobs

We use "transient" Redis Queue-managed (RQ) worker instances to process harvesting jobs in either a staging or production environment. They can be created as needed and then deleted after use. Once the workers have been created and provisioned, they will automatically look for jobs in the queue and run the full harvester code for those jobs.

1.1. Start stage workers

Log onto blackstar and run sudo su - hrv-stg
To start some worker machines (bare ec2 spot instances), run: ansible-playbook ~/code/ansible/start_ami.yml --extra-vars="count=1" .
- For on-demand instances, run: snsatnow ansible-playbook ~/code/ansible/start_ami_ondemand.yml --extra-vars="count=1"
- For an extra large (and costly!) on-demand instance (e.g., m4.2xlarge, m4.4xlarge), run: ansible-playbook ~/code/ansible/start_ami_ondemand.yml --extra-vars="worker_instance_type=m4.2xlarge" . If you create an extra large instance, make sure you terminate it after the harvesting job is completed!

The count=## parameter will set the number of instances to create. For harvesting one small collection you can set this to count=1. To re-harvest all collections, you can set this to count=20. For anything in between, use your judgment.

The default instance creation will attempt to get instances from the "spot" market so that it is cheaper to run the workers. Sometimes the spot market price can get very high and the spot instances won't work. You can check the pricing by issuing the following command on blackstar, hrv-stg user:

aws ec2 describe-spot-price-history --instance-types m3.large --availability-zone us-west-2c --product-description "Linux/UNIX (Amazon VPC)" --max-items 2

Our spot bid price is set to .133 which is the current (20160803) on demand price. If the history of spot prices is greater than that or if you see large fluctuations in the pricing, you can request an on-demand instance instead by running the ondemand playbook : (NOTE: the backslash \ is required)

ansible-playbook ~/code/ansible/start_ami_ondemand.yml --extra-vars="count=3"

1.2. Checking the status of a worker

Sometimes the status of the worker instances is unclear.

To check the processing status for a given worker, log into Blackstar and SSH to the particular stage or prod machine.

cd to /var/local/rqworker and locate the worker.log file.
Run tail -f worker.log to view the logs.

You can also use the ec2.py dynamic ansible inventory script with jq to parse the json to find info about the state of the worker instances.

First, refresh the cache for the dynamic inventory:

~/code/ec2.py --refresh-cache

To see the current info for the workers:

get_worker_info.sh

This will report the running or not state, the IPs, ec2 IDs & the size of workers.

You can then see the state of the instance by using jq to filter on the IP:

~/code/ec2.py | jq '._meta.hostvars["<ip address for instance>"].ec2_state'

This will tell you if it is running or not.

To get more information about the instance, just do less filtering:

~/code/ec2.py | jq -C '._meta.hostvars["<ip address for instance>"]' | less -R

1.3. Stop or terminate stage worker instances

Once harvesting jobs are completed (see steps below), terminate the worker instances.

Log into blackstar and run sudo su - hrv-stg
To just stop instances, run `ansible-playbook
Run: ansible-playbook -i ~/code/ec2.py ~/code/ansible/terminate_workers.yml <--limit=10.60.?.?> . You can use the limit parameter to specify a range of IP addresses for deletion.
To force terminate an instance, append --tags=terminate-instances
You'll receive a prompt to confirm that you want to spin down the intance; hit Return to confirm.

We should now leave one instance in a "stopped" state. Terminate all but one of the instances then run:

ansible-playbook -i ~/code/ec2.py ~/code/ansible/stop_workers.yml

This will stop the instance so it can be brought up easily. get_worker_info.sh should report the instance as "stopping" or "stopped".

2. Run harvest jobs: non-Nuxeo sources

2.1. New harvest or re-harvest?

Before initiating a harvest, confirm if the collection has previously been harvested -- or if it's a new collection.

If the collection has previously been harvested and is viewable in the Calisphere stage UI (http://calisphere-data.cdlib.org/), then delete the collection from CouchDB stage and Solr stage:

Log into the Collection Registry and look up the collection
Run Queue deletion of documents from CouchDB stage.
Then run Queue deletion of documents from Solr stage.
You can track the progress through the RQ Dashboard; once the jobs are done, a results report will be posted to the #dsc_harvesting_report channel in Slack.

If you need more control of the process (i.e. to put on a different queue), you can use the following command syntaxes on the dsc-blackstar role account:

./bin/delete_couchdb_collection.py [email protected] high-stage https://registry.cdlib.org/api/v1/collection/26275 ./bin/queue_delete_solr_collection.py [email protected] high-stage 26275

2.2. Harvest metadata to CouchDB stage

This process will harvest metadata from the target system into a resulting CouchDB record.

From the Collection Registry, select Queue harvest to CouchDB stage
You should then get feedback message verifying that the collections have been queued
You can track the progress through the RQ Dashboard; once the jobs are done, a results report will be posted to the #dsc_harvesting_report channel in Slack.

If you need more control of the process (i.e. to put on a different queue), you can use the following command syntax on the dsc-blackstar role account:

queue_harvest.py [email protected] high-stage https://registry.cdlib.org/api/v1/collection/26943

2.3. Harvest preview and thumbnail images

This process will hit the URL referenced in isShownAt in the CouchDB record to derive a small preview image (used for the object landing page); that preview image is also used for thumbnails in search/browse and related item results.

From the Collection Registry, select Queue image harvest
You should then get feedback message verifying that the collections have been queued
You can track the progress through the RQ Dashboard; once the jobs are done, a results report will be posted to the #dsc_harvesting_report channel in Slack.

If you need more control of the process (i.e. to put on a different queue), you can run use the following command syntax on the dsc-blackstar role account:

queue_image_harvest.py [email protected] high-stage https://registry.cdlib.org/api/v1/collection/26943

3. Run harvest jobs: Nuxeo

3.1. New harvest or re-harvest?

Before initiating a harvest, confirm if the collection has previously been harvested -- or if it's a new collection.

If the collection has previously been harvested and is viewable in the Calisphere stage UI (http://calisphere-data.cdlib.org/), then delete the collection from CouchDB stage and Solr stage:

Log into the Collection Registry and look up the collection
Run Queue deletion of documents from CouchDB stage.
Then run Queue deletion of documents from Solr stage.
You can track the progress through the RQ Dashboard; once the jobs are done, a results report will be posted to the #dsc_harvesting_report channel in Slack.

If you need more control of the process (i.e. to put on a different queue), you can use the following command syntaxes on the dsc-blackstar role account:

./bin/delete_couchdb_collection.py [email protected] high-stage https://registry.cdlib.org/api/v1/collection/26275 ./bin/queue_delete_solr_collection.py [email protected] high-stage 26275

3.2. Harvest and process access files from Nuxeo ("deep harvesting")

The process pulls files from the "Main Content File" section in Nuxeo, and formats them into access files for display in Calisphere. If you only need to pick up metadata changes in Nuxeo, skip this step. Here's what the process does:

It stashes a high quality copy of any associated media or text files on S3. These files appear on the object landing page, for interactive viewing:

If image, creates a zoomable jp2000 version and stash it on S3 for use with our IIIF-compatible Loris server. Tools used to convert the image include ImageMagick and Kakadu
If audio, stashes mp3 on s3.
If file (i.e. PDF), stashes on s3
If video, stashes mp4 on s3

Creates a small preview image (used for the object landing page) and complex object component thumbnails and stashes on S3. For these particular formats, it does the following:

If video, creates a thumbnail and stash on S3. Thumbnail is created by capturing the middle frame of the video using the ffmpeg tool.
If PDF, creates a thumbnail and stash on S3. Thumbnail is created by creating an image of the first page of the PDF, using ImageMagick.

Compiles full metadata and structural information (such as component order) for all complex objects, in the form of a media.json file. To view the media.json for a given object, use this URL syntax (where is the Nuxeo unique identifier, e.g., 70d7f57a-db0b-4a1a-b089-cce1cc289c9e): https://s3.amazonaws.com/static.ucldc.cdlib.org/media_json/<UID>-media.json

To run the "deep harvest" process:

Log into the Collection Registry and look up the collection
Run Queue Nuxeo deep harvest drop-down.
You can track the progress through the RQ Dashboard; once the jobs are done, a results report will be posted to the #dsc_harvesting_report channel in Slack.

If you need more control of the process (i.e. to put on a different queue), you can run use the following command syntax on the dsc-blackstar role account:

queue_deep_harvest.py [email protected] high-stage 26959

If there are problems with individual items, you can do a deep harvest for just one object by its Nuxeo path. You need to log onto dsc-blackstar and sudo to the hrv-stg role account. Then:

queue_deep_harvest_single_object.py "<path to assest wrapped with quotes>"

e.g.

queue_deep_harvest_single_object.py "/asset-library/UCR/Manuscript Collections/Godoi/box_01/curivsc_003_001_005.pdf"

This will run 4 jobs, one for grabbing files, one for creating jp2000 for access & IIIF, one to create thumbs and finally a job to produce the media_json file.

3.3. Harvest metadata to CouchDB stage

This process will harvest metadata from Nuxeo into a resulting CouchDB record.

From the Collection Registry, select Queue harvest to CouchDB stage
You should then get feedback message verifying that the collections have been queued
You can track the progress through the RQ Dashboard; once the jobs are done, a results report will be posted to the #dsc_harvesting_report channel in Slack.

If you need more control of the process (i.e. to put on a different queue), you can use the following command syntax on the dsc-blackstar role account:

queue_harvest.py [email protected] high-stage https://registry.cdlib.org/api/v1/collection/26943

3.4. Harvest preview image, also used for thumbnails

This process will hit the URL referenced in isShownBy in the CouchDB record to derive a small preview image (used for the object landing page); that preview image is also used for thumbnails in search/browse and related item results.

From the Collection Registry, select Queue image harvest
You should then get feedback message verifying that the collections have been queued
You can track the progress through the RQ Dashboard; once the jobs are done, a results report will be posted to the #dsc_harvesting_report channel in Slack.

If you need more control of the process (i.e. to put on a different queue), you can run use the following command syntax on the dsc-blackstar role account:

queue_image_harvest.py [email protected] high-stage https://registry.cdlib.org/api/v1/collection/26943

If there are problems with individual items, you can run the process on a specific object (or multiple objects) by referencing the harvest ID. You need to log onto dsc-blackstar and sudo to the hrv-stg role account. Then:

python ~/bin/queue_image_harvest_for_doc_ids.py [email protected] normal-stage 23065--http://ark.cdlib.org/ark:/13030/k600073n

For multiple items, separate the harvest IDs with commas:

python ~/bin/queue_image_harvest_for_doc_ids.py [email protected] normal-stage 23065--http://ark.cdlib.org/ark:/13030/k600073n,23065--http://ark.cdlib.org/ark:/13030/k6057mxb

4. QA check collection in CouchDB stage

4.1. Check the number of records in CouchDB

Query CouchDB stage using this URL syntax. Replace the key parameter with the key for the collection: https://harvest-stg.cdlib.org/couchdb/ucldc/_design/all_provider_docs/_view/by_provider_name_count?key="26189"
Results in the "value" parameter indicate the total number of metadata records harvested; this should align with the expected results.
If you have results, continue with QA checking the collection in CouchDB stage and Solr stage.
If there are no results, you will need to troubleshoot and re-harvest. See What to do when harvests fail section for details.

4.2. Additional QA checking

The objective of this part of the QA process is to ensure that source metadata (from a harvesting target) is correctly mapped through to CouchDB Suggested method is to review the 1) source metadata (e.g., original MARC21 record, original XTF-indexed metadata*) vis-a-vis the 2) a random sample of CouchDB results and 3) metadata crosswalk. Things to check:

Verify if metadata from the source record was carried over into CouchDB correctly: did any metadata get dropped?
Verify the metadata mappings: was the mapping handled correctly, going from the source metadata through to CouchDB, as defined in the metadata crosswalk?
Verify if any needed metadata remediation was completed (as defined in the metadata crosswalk) -- e.g., were rights statuses and statements globally applied?
Verify DPLA/CDL required data values -- are they present? If not, we may need to go back to the data provider to supply the information -- or potentially supply it for them (through the Collection Registry)
Verify the data values used within the various metadata elements:
Do the data values look "correct" (e.g., for Type, data values are drawn from the DCMI Type Vocabulary)?
Any funky characters or problems with formatting of the data?
Any data coming through that looks like it may have underlying copyright issues (e.g., full-text transcriptions)?
Are there any errors or noticeable problems?

NOTE: To view the original XTF-indexed metadata for content harvested from Calisphere:

Go to Collection Registry, locate the collection that was harvested from XTF, and skip to the "URL harvest" field -- use that URL to generate a result of the XTF-indexed metadata (view source code to see raw XML)
Append the following to the URL, to set the number of results: docsPerPage=###

Required Data QA Views

The Solr update process checks for a number of fields and will reject records that are missing these required values.

Image records without a harvested image

Objects with a sourceResource.type value of 'image' without a stored image (no 'object' field in the record) are not put into the Solr index. This view identifies these objects in couchdb.

https://harvest-stg.cdlib.org/couchdb/ucldc/_design/all_provider_docs/_view/image_type_missing_object

The base view will report total count of image type records without harvested images. To see how many per collection add "?group=true" to the URL.

https://harvest-stg.cdlib.org/couchdb/ucldc/_design/all_provider_docs/_view/image_type_missing_object?group=true

To find the number for a given collection use the "key" parameter:

https://harvest-stg.cdlib.org/couchdb/ucldc/_design/all_provider_docs/_view/image_type_missing_object?key="<collection id>"

NOTE: the double quotes are necessary in the URL.

To see the ids for the records with this issue turn off the reduce fn:

https://harvest-stg.cdlib.org/couchdb/ucldc/_design/all_provider_docs/_view/image_type_missing_object?key="<collection id>"&reduce=false

Use the include_docs parameter to add the records to the view output:

https://harvest-stg.cdlib.org/couchdb/ucldc/_design/all_provider_docs/_view/image_type_missing_object?key="<collection id>"&reduce=false&include_docs=true

Records missing isShownAt

https://harvest-stg.cdlib.org/couchdb/ucldc/_design/all_provider_docs/_view/missing_isShownAt

As with the above you can add various parameters to get different information in the result.

Records missing title

https://harvest-stg.cdlib.org/couchdb/ucldc/_design/all_provider_docs/_view/missing_title

Querying CouchDB stage

Generate a count of all objects for a given collection in CouchDB: https://harvest-stg.cdlib.org/couchdb/ucldc/_design/all_provider_docs/_view/by_provider_name_count?key="26189"
Generate a results set of metadata records for a given collection in CouchDB, using this URL syntax: https://harvest-stg.cdlib.org/couchdb/ucldc/_design/all_provider_docs/_list/has_field_value/by_provider_name_wdoc?key="10046"&field=originalRecord.subject&limit=100. Each metadata record in the results set will have a unique ID (e.g., 26094--00000001). This can be used for viewing the metadata within the CouchDB UI.
Parameters:
field: Optional. Limit the display output to a particular field.
key: Optional. Limits by collection, using the Collection Registry numeric ID.
limit: Optional. Sets the number or results
originalRecord: Optional. Limit the display output to a particular metadata field; specify the CouchDB data element (e.g., title, creator)
include_docs="true": Optional. Will include complete metadata record within the results set (JSON output)
value: Optional. Search for a particular value, within a results set of metadata records from a particular collection. Note: exact matches only!
group=true: Group the results by key
reduce=false: do not count up the results, display the individual result rows
To generate a results set of data values within a particular element (e.g., Rights), for metadata records from all collections: https://harvest-stg.cdlib.org/couchdb/ucldc/_design/qa_reports/_view/sourceResource.rights_value?limit=100&group_level=2
To check if there are null data values within a particular element (e.g., isShownAt), for metadata records from all collections: https://harvest-stg.cdlib.org/couchdb/ucldc/_design/qa_reports/_view/isShownAt_value?limit=100&group_level=2&start_key=["__MISSING__"]
To view a result of raw CouchDB JSON output: https://harvest-stg.cdlib.org/couchdb/ucldc/_design/all_provider_docs/_view/by_provider_name?key="26094"&limit=1&include_docs=true
Consult the CouchDB guide for additional query details.

Viewing metadata for an object in CouchDB stage

Log into CouchDB
In the "Jump to" box, enter the unique ID for a given metadata record (e.g., 26094--00000001)
You can now view the metadata in either its source format or mapped to CouchDB fields

5. Sync CouchDB stage to Solr stage

This process will update the Solr stage index with records from CouchDB stage:

From the Collection Registry, select Queue sync from CouchDB stage to Solr stage
You should then get feedback message verifying that the collections have been queued
You can track the progress through the RQ Dashboard; once the jobs are done, a results report will be posted to the #dsc_harvesting_report channel in Slack.

If you need more control of the process (i.e. to put on a different queue), you can run the queue_sync_to_solr.py on dsc-blackstar role account:

queue_sync_to_solr.py [email protected] high-stage 26943

6. QA check collection in Solr stage

You can view the raw results in Solr stage; this may be helpful to verify mapping issues or discrepancies in data between CouchDB and Solr stage.

Log into Solr to conduct queries
Generate a count of all objects for a given collection in Solr: https://harvest-stg.cdlib.org/solr/dc-collection/query?q=collection_url:%22https://registry.cdlib.org/api/v1/collection/26559/%22
Generates counts for all collections: https://harvest-stg.cdlib.org/solr/dc-collection/select?q=*%3A*&rows=0&wt=json&indent=true&facet=true&facet.query=true&facet.field=collection_url&facet.limit=-1&facet.sort=count
Consult the Solr guide for additional query details.

7. QA check in Calisphere stage UI

You can preview the Solr stage index in the Calisphere UI at http://calisphere-data.cdlib.org/.

To immediately view results, you can QA the Solr stage index on your local workstation, following these steps ("Windows install"). In the run.bat configuration file, point UCLDC_SOLR_URL to https://harvest-stg.cdlib.org/solr_api.

Moving a harvest to production

8. Manage workers to process harvesting jobs

Follow the steps outlined above for starting and managing worker instances -- but once logged into blackstar, use sudo su - hrv-prd to create workers in the production environment.

9. Sync the collection from CouchDB stage to CouchDB production

Once the CouchDB and Solr stage data looks good and the collection looks ready to publish to Calisphere, start by syncing CouchDB stage to the CouchDB production:

In the Registry, edit the collection and check the box "Ready for publication" and save the collection.
Then select Queue Sync to production CouchDB for collection from the action on the Collection page.

If you need more control of the process (i.e. to put on a different queue), you can run the queue_sync_couchdb_collection.py on dsc-blackstar role account:

./bin/queue_sync_couchdb_collection.py [email protected] high-stage https://registry.cdlib.org/api/v1/collection/26681/

10. Sync the collection from CouchDB production to Solr production

This process will update the Solr production index ("candidate Solr index") with records from CouchDB production:

From the Collection Registry, select Queue sync from CouchDB production to Solr production
You should then get feedback message verifying that the collections have been queued
You can track the progress through the RQ Dashboard; once the jobs are done, a results report will be posted to the #dsc_harvesting_report channel in Slack.

If you need more control of the process (i.e. to put on a different queue), you can run the queue_sync_to_solr.py on dsc-blackstar role account:

queue_sync_to_solr.py [email protected] high-stage 26943

11. QA check candidate Solr index in Calisphere UI

You can preview the candidate Solr index in the Calisphere UI at http://calisphere-test.cdlib.org/.

To immediately view results, you can QA the Solr stage index on your local workstation, following these steps ("Windows install"). In the run.bat configuration file, point UCLDC_SOLR_URL to https://harvest-prd.cdlib.org/solr_api.

12. Generate and review QA report for candidate Solr index

Generate and review a QA report for the candidate Solr index, following these steps. The main QA report in particular summarizes differences in item counts in the candidate Solr index compared with the current production index.

Updating Elastic Beanstalk with candidate Solr index

This section describes how to update an Elastic Beanstalk configuration to point to a new candidate Solr index stored on S3. This will update the specified Calisphere front-end web application so that it points to the data from Solr:

Log onto blackstar & sudo su - hrv-prd and then follow the instructions here: update_beanstalk
After any new index is moved into publication, run the following commands, so that ARK URLs correctly resolve for any new incoming harvested objects with embedded ARKs: https://gist.github.com/tingletech/475ff92147b6f93f6c3f60cebdf5e507
Last, update our Google Doc that lists out new collections that were published. (The entries can be cut-and-pasted from the QA reporting spreadsheet): https://docs.google.com/spreadsheets/d/1FI2h6JXrqUdONDjRBETeQjO_vkusIuG5OR5GWUmKp1c/edit#gid=0 . Sherri uses this Google Doc for CDLINFO postings, highlighting newly-published collections.

TODO: add how to run the QA spreadsheet generating code

Removing items or collections (takedown requests)

Removing collections involves deleting records from CouchDB stage and production environments, as well as Solr stage and production environments; and then updating the Elastic Beanstalk:

Individual items

Log into CouchDB stage; search for and delete the specific item record. Repeat the process on CouchDB production -or-
Create a list of the CouchDB identifiers for the items, and add them to a file (one per line). Then run delete_couchdb_id_list.py with the file as input:delete_couchdb_id_list.py <file with list of ids>
From the Collection Registry, select Queue sync from from CouchDB stage to Solr stage and Queue sync from CouchDB production to Solr production
Update Elastic Beanstalk with the updated Solr index

Entire collection

From the Collection Registry, select Queue deletion of documents from CouchDB stage, Queue deletion of documents from Solr stage, Queue deletion of documents from CouchDB production, and Queue deletion of documents from Solr production
Update the Collection Registry entry, setting "Ready to publish" to "None" -- and change the harvesting endpoint to "None"
Update Elastic Beanstalk with the updated Solr index

If you need more control of the process (i.e. to put on a different queue), you can use the following command syntaxes on the dsc-blackstar role account:

./bin/delete_couchdb_collection.py [email protected] high-stage https://registry.cdlib.org/api/v1/collection/26275 ./bin/queue_delete_solr_collection.py [email protected] high-stage 26275

Restoring collections from production

We've had a couple of cases where the pre-prodution index has had a collection deleted for re-harvesting but the re-harvest has not been successful and we want to publish a new image. This script will take the documents from one solr index and push them to another solr index. This script can be run from the hrv-stg or hrv-prd account. For each, the source documents come from solr.calisphere.org which drives Calisphere. Depending on which role account you are in, it will either update the "stage" or the pre-production solr.

Log onto the appropriate role account (hrv-stg or hrv-prd). That will set the context for the originating solr index, from which you want to push data.
run sync_solr_documents.py <collection id> to push the data to the target solr index.

Additional resources

Running long processes

The snsatnow wrapper script may be used to run any long running process. It will background and detach the process so you can log out. When the process finishes or fails, a message will be sent to the dsc_harvesting_repot Slack channel.

To use the script, just add it to your script invocation

snsatnow <cmd> --<options> <arg1> <arg2>....

NOTE: if your command has arguments that are surrounded by quotes (") you'll need to escape those by putting a backslash () in front of them.

Picking up new harvester or ingest code

When new harvester or ingest code is pushed, you need to create a new generation of worker machines to pick up the new code:

First, terminate the existing machines: ansible-playbook -i ~/code/ec2.py ~/code/ingest_deploy/ansible/terminate_workers.yml <--limit=10.60.?.?>
Then go through the worker create process again, creating and provisioning machines as needed.

Recreating the Solr Index from scratch

The solr index is run in a docker container. To make changes to the schema or other configurations, you need to recreate the docker image for the container.

NOTE: THIS NEEDS UPDATING To do so in the ingest environment, run ansible-playbook -i hosts solr_docker_rebuild.yml. This will remove the docker container & image, rebuild the image, remove the index files and run a new container based on the latest solr config in https://github.com/ucldc/solr_api/.

You will then have to run /usr/local/solr-update.sh --since=0 to reindex the whole couchdb database.

How to find a CouchDB source document for an item in Calisphere

See the new tool for automating this here: https://github.com/mredar/ucldc_api_data_quality/blob/master/reporting/README.md

Tracing back to the document source in CouchDB is critical to diagnose problems with data and images.

Get the Solr id for the item. This is the part of the URL after the /item/ without the final slash. For https://calisphere.org/item/32e2220c1e918cf17f0597d181fa7e3e/, the Solr ID is 32e2220c1e918cf17f0597d181fa7e3e.

Now go to the Solr index of interest and query for the id: https://harvest-stg.cdlib.org/solr/dc-collection/select?q=32e2220c1e918cf17f0597d181fa7e3e&wt=json&indent=true

Find the harvest_id_s value, in this case "26094--LAPL00050887". Then plug this into CouchDB for the ucldc database: https://harvest-stg.cdlib.org/couchdb/ucldc/26094--LAPL00050887 (or with the UI - https://harvest-stg.cdlib.org/couchdb/_utils/document.html?ucldc/26094--LAPL00050887)

Creating/Harvesting with High Stage Workers

Sometimes you may need to create one or more "High Stage" workers, for example if the normal stage worker queue is very full and you need to run a harvest job without waiting for the queue to empty. The process is performed from the hrv-stg command line as follows.

Creating high stage workers:

Log onto blackstar and run sudo su - hrv-stg
Create one or more worker machines just as you would in the "developer" (see below) process: snsatnow ansible-playbook ~/code/ansible/create_worker.yml --extra-vars=\"count=1\" .
After workers are created, run get_worker_info.sh and compare results to currently provisioned/running "normal" workers RQ dashboard to determine the IP addresses of new workers.
Provision with --extra-vars="rq_work_queues=['high-stage']" switch to make new workers high stage workers. Also use --limit switch with IP addresses of new workers from step above to only provision new workers. Do NOT re-provision running workers! Full example command: snsatnow ansible-playbook -i ~/code/ec2.py ~/code/ansible/provision_worker.yml --limit=10.60.29.* --extra-vars="rq_work_queues=['high-stage']"

Running jobs on high stage workers:

From hrv-stg command line, run the following command to queue a high-stage harvest, providing your EMAIL address and collection # to harvest for XXXXX where appropriate: ./bin/queue_harvest.py [email protected] high-stage https://registry.cdlib.org/api/v1/collection/XXXXX/
To queue an image harvest or solr sync, replace the first part of the command above with ./bin/queue_image_harvest.py or ./bin/queue_sync_to_solr.py, respectively
More commands can be found in the bin folder by running ls ./bin from command line. Most are self-explanatory from the script titles. Again, just replace the first part of the full command above with ./bin/other-script-here.py as needed
When finished harvesting, terminate the high-stage workers as you would any other. EX: ansible-playbook -i ~/code/ec2.py ~/code/ansible/terminate_workers.yml <--limit=10.60.?.?>

Fixes for Common Problems

What to do when harvests fail

First take a look at the RQ Dashboard. There will be a bit of the error message there. Hopefully this would identify the error and you can modify whatever is going wrong.

Common worker error messages

Worker forcibly terminated, while job was in-progress: ShutDownImminentException('shut down imminent (signal: %s)' % signal_name(signum), info) ShutDownImminentException: shut down imminent (signal: SIGALRM)
(More forthcoming...)

Checking the logs

If you need more extensive access to logs, they are all stored on the AWS CloudWatch platform. The /var/local/rqworker & /var/local/akara contain the logs from the worker processes & the Akara server on a worker instance. The logs are named with the instance id & ip address, e.g. ingest-stage-i-127546c9-10.60.28.224

From the blackstar machine you can access the logs on CloudWatch using the scripts in the bin directory

First, get the IPs of the worker machines by running get_worker_info.sh

Then for the worker whose logs you want to examine: get_log_events_for_rqworker.sh <worker ip>

This is an output of the rqworker log, for the akara log use: get_log_events_for_akara.sh <worker ip>

If you need to go back further in the log history, for now ask Mark.

If this doesn't get you enough information, you can ssh to a worker instance and watch the logs real time if you like. tail -f /var/local/rqworker/worker.log or /var/local/akara/logs/error.log.

Image problems

Verify if and what files were harvested, for a given object

Use the following script in the ucldc_api_data_quality/reporting directory (following the steps at https://github.com/mredar/ucldc_api_data_quality/tree/master/reporting) to generate a report for the object. The value is the id for the object, as reflected in Solr or CouchDB (e.g., 6d445613-63d3-4144-a530-718900676db9):

python get_couchdata_for_calisphere_id.py <ID>

Example report result:

===========================================================================
Calisphere/Solr ID: 6d445613-63d3-4144-a530-718900676db9
CouchDB ID: 26883--6d445613-63d3-4144-a530-718900676db9
isShownAt: https://calisphere.org/item/6d445613-63d3-4144-a530-718900676db9
isShownBy: https://nuxeo.cdlib.org/Nuxeo/nxpicsfile/default/6d445613-63d3-4144-a530-718900676db9/Medium:content/
object: ce843950f622d303b83256add5b19d34
preview: https://calisphere.org/clip/500x500/ce843950f622d303b83256add5b19d34
===========================================================================

The URL in isShownBy reflects the endpoint to an file, which is used by the harvesting code ("Queue image harvest to CouchDB stage" action) to derive a small preview image (used for the object landing page); that preview image is also used for thumbnails in search/browse and related item results. Note that you can also verify isShownBy by looking up the object in CouchDB.

The URL in preview points to the resulting preview image.

No preview image, or thumbnail in search/browse results? (Nuxeo and non-Nuxeo sources)

Double-check the URL in the preview field. If there's no functional URL in preview (value indicates "None"), then a file was not successfully harvested. To fix:

Try re-running the process to harvest preview and thumbnail images image
Check again to see if the URL now shows up in the preview field. If so, sync from CouchDB stage to Solr stage

For Nuxeo-based objects, the following logic is baked into the process for harvesting preview and thumbnail images:

If object has an image at the parent level, use that. Otherwise, if component(s) have images, use the first one we can find
If an object has a PDF or video at parent level, use the image stashed on S3
Otherwise, return "None"

No access files, preview image (for PDF or video objects), or complex object component thumbnails? (Nuxeo only)

The media.json output created through the "deep harvest" process references URL links back to the source files in Nuxeo. If there's no media.json file -- or if the media.json has broken or missing URLs -- then the files could not be successfully harvested. To fix:

Try re-running the deep harvest for a single object to regenerate the media.json and files.
Check the media.json again, to confirm that it was generated and/or its URLs resolve to files. If AOK, sync from CouchDB stage to Solr stage

Persistent older versions of access files, preview image (for PDF or video objects), or complex object component thumbnails? (Nuxeo only)

If older versions of the files don't clear out after re-running a deep harvest, you can manually queue the image harvest to force it to re-fetch images from Nuxeo. First, you need to clear the "CouchDB ID -> image url" cache and then set the image harvest to run with the flag --get_if_object (so get the image even if the "object" field exists in the CouchDB document)

Log onto blackstar & sudo su - hrv-stg
Run python ~/bin/redis_delete_harvested_images_script.py <collection_id>. This will produce a file called delete_image_cache-<collection_id> in the current directory.
Run redis.sh < delete_image_cache-<collection_id>. This will clear the cache of previously harvested URLs.
Run python ~/bin/queue_image_harvest.py [email protected] normal-stage https://registry.cdlib.org/api/v1/collection/<collection_id>/ --get_if_object

Development

ingest_deploy

Ansible, packer and vagrant project for building and running ingest environment on AWS and locally. Currently only the ansible is working, need to get a local vagrant version working....

Dependencies

Tools

Ansible (Version X.X)

Addendum: Building new worker images - For Developers

Log onto blackstar and run sudo su - hrv-stg
To start some worker machines (bare ec2 spot instances), run: snsatnow ansible-playbook ~/code/ansible/create_worker.yml --extra-vars=\"count=1\" .
- For on-demand instances, run: snsatnow ansible-playbook ~/code/ansible/create_worker_ondemand.yml --extra-vars=\"count=1\"
- For an extra large (and costly!) on-demand instance (e.g., m4.2xlarge, m4.4xlarge), run: ansible-playbook ~/code/ansible/create_worker_ondemand.yml --extra-vars="worker_instance_type=m4.2xlarge" . If you create an extra large instance, make sure you terminate it after the harvesting job is completed!

The count=## parameter will set the number of instances to create. For harvesting one small collection you can set this to count=1. To re-harvest all collections, you can set this to count=20. For anything in between, use your judgment.

With the snsatnow wrapper, the results will be messaged to the dsc_harvesting_report Slack channel when the instances are created.

The default instance creation will attempt to get instances from the "spot" market so that it is cheaper to run the workers. Sometimes the spot market price can get very high and the spot instances won't work. You can check the pricing by issuing the following command on blackstar, hrv-stg user:

aws ec2 describe-spot-price-history --instance-types m3.large --availability-zone us-west-2c --product-description "Linux/UNIX (Amazon VPC)" --max-items 2

Our spot bid price is set to .133 which is the current (20160803) on demand price. If the history of spot prices is greater than that or if you see large fluctuations in the pricing, you can request an on-demand instance instead by running the ondemand playbook : (NOTE: the backslash \ is required)

snsatnow ansible-playbook ~/code/ansible/create_worker_ondemand.yml --extra-vars=\"count=3\"

Provision stage workers to act on harvesting jobs

If you restarted a stopped instance, you don't need to do the steps below

Once this is done and the stage worker instances are in a state of "running", you'll need to provision the workers by installing required software, configurations and start running Akara and the worker processes that listen on the queues specified:

Log onto blackstar and run sudo su - hrv-stg
To provision the workers, run: snsatnow ansible-playbook -i ~/code/ec2.py ~/code/ansible/provision_worker.yml
Wait for the provisioning to finish; this can take a while, 5-10 minutes is not unusual. If the provisioning process stalls, use ctrl-C to end the process then re-do the ansible command.
Check the status of the the harvesting process through the RQ Dashboard. You should now see the provisioned workers listed, and acting on the jobs in the queue. You will be able to see the workers running jobs (indicated by a "play" triangle icon) and then finishing (indicated by a "pause" icon).

Limiting provisioning by IP

If you already have provisioned worker machines running jobs, use the --limit=<ip range> eg. --limit=10.60.22.* or --limit=<ip>,<ip> eg. --limit=10.60.29.109,10.60.18.34 to limit the provisioning to the IPs of the newly-provisioned machines (and so you don't reprovision a currently running machine). Otherwise rerunning the provisioning will put the current running workers in a bad state, and you will then have to log on to the worker and restart the worker process or terminate the machine. Example of full command: snsatnow ansible-playbook -i ~/code/ec2.py ~/code/ansible/provision_worker.yml --limit=10.60.29.*

AWS assigns unique subnets to the groups of workers you start, so in general, different generations of machines will be distinguished by the different C class subnet. This makes the --limit parameter quite useful.

Provisioning workers to specific queues

By default, stage workers will be provisioned to a "normal-stage" queue. To provision them to a different queue -- e.g., "high-stage", use the following command with the --extra-vars parameter:

ansible-playbook -i ~/code/ec2.py ~/code/ansible/provision_worker.yml --limit=10.60.22.123 --extra-vars="rq_work_queues=['high-stage']"

Creating new worker AMI

Once you have a new worker up and running with the new code, you need to create an image from it. From the appropriate environment:

ansible-playbook -i hosts ~/code/ansible/create_worker_ami.yml --extra-vars="instance_id=<running worker instance id>"

You can get the instance_id by running get_worker_info.sh.

This will produce a new image named _worker_YYYYMMDD. Note the image id that is returned by this command.

You now need to update the image id for the environment. Edit the file ~/code/ansible/group_vars/ (either stage or prod). Change the worker_ami value to the new image id e.g:

worker_ami: ami-XXXXXX

License

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
Neither the name of the University of California nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Name		Name	Last commit message	Last commit date
Latest commit History 893 Commits
ansible		ansible
docs/images		docs/images
logs_from_cli		logs_from_cli
packer		packer
update_beanstalk_index		update_beanstalk_index
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Vagrantfile		Vagrantfile
spot-instance-price-history-stage.sh		spot-instance-price-history-stage.sh

License

gamontoya/ingest_deploy

Folders and files

Latest commit

History

Repository files navigation

Harvesting infrastructure components

UCLDC Harvesting operations guide

1. Managing workers to process harvesting jobs

1.1. Start stage workers

1.2. Checking the status of a worker

1.3. Stop or terminate stage worker instances

2. Run harvest jobs: non-Nuxeo sources

2.1. New harvest or re-harvest?

2.2. Harvest metadata to CouchDB stage

2.3. Harvest preview and thumbnail images

3. Run harvest jobs: Nuxeo

3.1. New harvest or re-harvest?

3.2. Harvest and process access files from Nuxeo ("deep harvesting")

3.3. Harvest metadata to CouchDB stage

3.4. Harvest preview image, also used for thumbnails

4. QA check collection in CouchDB stage

4.1. Check the number of records in CouchDB

4.2. Additional QA checking

Required Data QA Views

Image records without a harvested image

Records missing isShownAt

Records missing title

Querying CouchDB stage

Viewing metadata for an object in CouchDB stage

5. Sync CouchDB stage to Solr stage

6. QA check collection in Solr stage

7. QA check in Calisphere stage UI

8. Manage workers to process harvesting jobs

9. Sync the collection from CouchDB stage to CouchDB production

10. Sync the collection from CouchDB production to Solr production

11. QA check candidate Solr index in Calisphere UI

12. Generate and review QA report for candidate Solr index

See the new tool for automating this here: https://github.com/mredar/ucldc_api_data_quality/blob/master/reporting/README.md

Common worker error messages

Checking the logs

Verify if and what files were harvested, for a given object

No preview image, or thumbnail in search/browse results? (Nuxeo and non-Nuxeo sources)

No access files, preview image (for PDF or video objects), or complex object component thumbnails? (Nuxeo only)

Persistent older versions of access files, preview image (for PDF or video objects), or complex object component thumbnails? (Nuxeo only)

Development

ingest_deploy

Dependencies

Tools

Addendum: Building new worker images - For Developers

Limiting provisioning by IP

Provisioning workers to specific queues

Creating new worker AMI

License

About

Resources

License

Stars

Watchers

Forks

Languages