Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

T343 schoolie #514

Merged
merged 23 commits into from
Mar 7, 2024
Merged

T343 schoolie #514

merged 23 commits into from
Mar 7, 2024

Conversation

kerchner
Copy link
Member

@kerchner kerchner commented Feb 29, 2024

This PR includes a rake task to migrate existing data, so it is recommended to test not only on:

  1. A clean instance, but also:
  2. An instance with existing ETDs, as well as at least one non-ETD Work. These ETDs should (as all current ETDs do) have resource_type set to Thesis or Dissertation and should have degree populated. Populate at least one ETD with a masters flavored degree, and at least one ETD with a doctoral flavored degree. Ensure that at least one of the ETDs has 2 or more keywords populated.

To Test

On an instance with existing ETDs

See prerequisite data above.

  1. Run the gwss:enumerate_degree_types rake task and observe that the task prints out unique degree values, with a count of each
  2. Run the gwss:reassign_etd_resource_types task and observe that:
  • the rake task's messages contained a "reassigned" message for each ETD, and no "degree name not found" messages.
  • view in the UI that every ETD's resource_type was reassigned to either Master's Thesis or Dissertation

On a new instance

  1. Create an ETD that is has resource_type set to Master's Thesis, and an ETD with resource_type set to Dissertation. Ensure that at one of them has 2 or more keywords.
  2. Create at least one non-ETD work.

On both new and migrated instances

  1. View the Page Source for an ETD. Observe that the <head> tag contains a block similar to this:
<!-- Schoolie for Google Scholar metadata (replaces :gscholar_meta) -->
<meta name="citation_institution" value="George Washington University">
<meta name="citation_title" value="Test ETD">
<meta name="citation_author" value="Kerchner, Daniel">
<meta name="citation_type" value="Master&#39;s Thesis">
<meta name="dc.type" value="Master&#39;s Thesis">
<meta name="citation_keywords" value="keyword1">
<meta name="citation_keywords" value="keyword2">
<meta name="citation_pdf_url" value="http://ec2-35-91-19-226.us-west-2.compute.amazonaws.com/downloads/6969z076v">
  1. Scrape the URL from the citation_pdf_url meta tag above and confirm that the browser downloads (and does not display in-browser) the expected PDF.

  2. View the Page Source for a non-ETD work. Observe that <head> tag contains a block similar to this:

<!-- Schoolie for Google Scholar metadata (replaces :gscholar_meta) -->

with none of the other tags shown in the previous step.

Checking the sitemap

  1. Confirm that the schoolie:sitemap rake task creates the sitemap and that it looks similar to this:
<?xml version="1.0"?>
<urlset xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
  <url>
    <loc>https://scholarspace.library.gwu.edu/etd/9306sz28s</loc>
    <lastmod>2024-02-29T16:59:42Z</lastmod>
  </url>
  <url>
    <loc>https://scholarspace.library.gwu.edu/etd/3069xz28s</loc>
    <lastmod>2024-02-29T15:56:00Z</lastmod>
  </url>
</urlset>

Note that the URL is going to be scholarspace.library.gwu.edu not your local server.

  1. Adjusting for the URL, confirm that the URLs lead to the landing pages of each ETD (not the download URLs)
  2. Confirm that the lastmod time seems to correspond to the time the work was last modified
  3. Confirm that non-ETDs are not included in the sitemap. I know that's weird, but right now we're trying to avoid ticking off Google Scholar because we have "mixed" content - scholarly and non-scholarly. Hopefully this is temporary.
  4. Confirm that the sitemap is regenerated by the scheduled sitemap regeneration cron job. Perhaps delete it first to confirm, or you can rely on the timestamp. Either wait until tomorrow, or adjust the schedule as per T461 sitemap setup #487

Known issues

I cannot seem to get the sitemap generation to trigger using the gwss:sitemap_queue_generate rake task. The only difference is that I've replaced the contents of SitemapRegenerateJob to call the new rake task instead. Should I be concerned?

Thoughts on the implementation

I am not crazy about having added scholarly? into the presenters rather than the work models, but I couldn't find a clean way to be able to ask the work's scholarly? from the HTML. The reason that I think ultimately it belongs on the work, is so that we might add more logic to the function that can answer true/false perhaps depending on the values of certain properties. Since we don't yet have such a requirement, this is something we can refactor later.

Any feedback on the code/implementation is most welcome.

@kerchner kerchner requested a review from alepbloyd February 29, 2024 17:10
app/gw_work.rb Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kerchner Should this file be here? There's a copy in /app/models/gw_work.rb which makes sense, but not sure why we would want this in the /app root folder.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed it should not. I've removed it.

@alepbloyd
Copy link
Contributor

I ran the rake tasks on the backup data I have (which is from like september-ish 2023). I was able to run the gwss:enumerate_degree_types task and got a list of results, but there are these two categories that look like they have empty strings or null values.
image

Digging back through the logs, and I did find this Degree name not found line.
image

Probably related to that, when I ran the gwss:reassign_etd_resource_types task - it ran fine for probably about 15 minutes or so, and then I got a NoMethodError: undefined method 'upcase' for nil:NilClass for line 341 in the task.
image

Might need to add in a way to handle these few works with null values in the field, since it looks like we have those in prod as well:
image

@kerchner
Copy link
Member Author

kerchner commented Mar 4, 2024

@alepbloyd Got it, I'll fix to skip those (leave as Thesis or Dissertation) but identify them. This will allow us to target those three ETDs in production to manually correct their degree values as needed, and set resource_type accordingly.

Will keep you posted when I update the rake task.

@kerchner
Copy link
Member Author

kerchner commented Mar 5, 2024

@alepbloyd Please re-test gwss:reassign_etd_resource_types

@alepbloyd
Copy link
Contributor

@kerchner

update on testing this:

I re-restored my scholarspace instance for the backups I have, and then ran the gwss:enumerate_degree_types task - worked fine. After that, I started the gwss:reassign_etd_resource_types task. It took about 4.5 hours to run on the backup data, but didn't get any errors or anything. Looking through the UI, all looks good.

I did run into a situation where two or three minutes after the task finished, I seemed to get kicked out of my EC2 server and couldn't reconnect until I went into the AWS console and rebooted it. Now I can connect again fine, so not really sure what that was about - since it ran fine for hours and didn't have any issue until after the fact.

I still need to look at the sitemap aspects of this, aiming to finish today/tomorrow.

@alepbloyd
Copy link
Contributor

@kerchner

  • Confirmed that I am seeing the tags for ETDs in the page source, looks like it worked with multiple keywords as well, and not seeing them for non-ETDs.
  • For the citation_pdf_url, I did need to change it from http to https, though I would imagine that's just something with my SSL setup on my dev server.
  • Regenerated the sitemap using bundle exec rails schoolie:sitemap RAILS_ENV=production on my server, confirmed that looks correct and the landing page links are working (with URL replaced).
    • We can actually see the ETDs that didn't have their Resource Type updated, because they are the only ones in the sitemap where the doesn't correspond with the time I ran the reassign_resource_types task.
  • Confirmed that non-ETDs are not appearing in the sitemap.
  • I didn't have any issues when I ran the gwss:sitemap_queue_generate task. Seemed to work fine for me, and logs point to it running on Sidekiq, so that seems correct.

I just deleted the sitemap, and will check back on it tomorrow morning to verify if it regenerated as per the cron job.

So beyond that one pending aspect, everything seems to be working!

@kerchner
Copy link
Member Author

kerchner commented Mar 6, 2024

Thanks @alepbloyd ! Keep me posted on the sitemap.

@kerchner
Copy link
Member Author

kerchner commented Mar 6, 2024

I'm going to remove sitemap from Gemfile and also remove config/sitemap.rb. We probably will want to add back more pages to the sitemap in the near future, but not until we've succeeded in having Google Scholar index ETDs.

@alepbloyd alepbloyd self-requested a review March 7, 2024 15:02
Copy link
Contributor

@alepbloyd alepbloyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@kerchner kerchner merged commit 9783136 into master Mar 7, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants