Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

t570 - Tasks to update existing GwETDs with proquest zipfile name, and to download only new ETDs from S3 #572

Merged
merged 3 commits into from
Dec 15, 2024

Conversation

kerchner
Copy link
Member

@kerchner kerchner commented Dec 9, 2024

Fixes #570.

To test:

BEFORE switching to this branch

...so, while on the main branch, download a few ETDs from the proquest-etds bucket, copy them to /opt/scholarspace/scholarspace-ingest.

  • Also upload these same ETDs to the proquest-etds-test2 S3 bucket.
  • Run the gwss:ingest_pq_etds task. Then, do a bulkrax import. See https://github.com/gwu-libraries/scholarspace-hyrax/wiki/Bulkrax-imports for details. Make sure to clear out the /opt/scholarspace/scholarspace-ingest folder as well as the /opt/scholarspace/scholarspace-hyrax/tmp/bulkrax_zip folder afterwards.

AFTER switching to this branch

  • Populate .env with the new credential values at the bottom of example.env. There is a bucket called proquest-etds-test2 (in us-east-1) that you can use. You can either create an additional, new credential pair, or contact @kerchner for existing credentials.
  • Ensure that the proquest-etds-test2 contains some ProQuest ETDs (zip files) that your environment does not already have. You can copy a few more ProQuest ETDs over from proquest-etds to accomplish this.
  • Run the gwss:populate_etd_proquest_zipfile task. Edit one of the GwETDs and observe that proquest_zipfile is (correctly) populated
  • Add a few more ETDs from the proquest-etds S3 bucket to proquest-etds-test-2.
  • Run the gwss:download_new_pq_zips task. Observe that the "new" ETDs have been downloaded to /opt/scholarspace/scholarspace-ingest
  • Run the gwss:ingest_pq_etds task. Then, do a bulkrax import. See https://github.com/gwu-libraries/scholarspace-hyrax/wiki/Bulkrax-imports for details. Observe (via the work edit form) that the newly loaded ETDs are populated with proquest_zipfile values.

@dolsysmith
Copy link
Contributor

Update to documentation: it's necessary to add the path for the download destination to the gwss:download_new_pq_zips task: bundle exec rails gwss:download_new_pq_zips['/opt/scholarspace/scholarspace-ingest/etd_zips']

@dolsysmith
Copy link
Contributor

Confirming that the tasks perform as expected: ETD metadata is updated to reflect the filenames on S3, and only ETD's are downloaded that don't match filenames existing in the Hyrax data-store.

Would it be worth including (as a step in the documentation) that one should delete the files from the tmp folder after a successful Bulkrax import? It's not strictly necessary, but they will otherwise tend to accumulate there.

@kerchner
Copy link
Member Author

Thanks @dolsysmith . Updated Bulkrax ingest documentation on the wiki, to include reminder to clean up ETD zips and Bulkrax manifest before doing a new download and ingest.

@kerchner kerchner merged commit db08b80 into master Dec 15, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create task to discover new ETDs in S3, and copy to folder for creation of bulkrax manifest
2 participants