Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds exact journal matches to historical analysis #65

Merged
merged 1 commit into from
Jul 31, 2024

Conversation

JPrevost
Copy link
Member

Why are these changes being introduced:

  • Understanding our ability to detect search intent over time is core to our ability to know how we are doing

Relevant ticket(s):

How does this address that need:

  • Adds new field to track exact journal matches to Metrics::Algorithms
  • Updates Metrics::Algorithms to run journal exact matches in addition to the existing StandardIdentifer matches. This included a refactor to better support multiple match types.

Document any side effects to this change:

Journal exact matching is not guaranteed to be an indicator of user search intent because Journal names are also common words in many cases. When we build our our validation workflows, we'll be able to understand what percentage of these types of matches are definitely Journals and what percentage is ambiguous. We can likely update our algorithm to drop some of the more ambiguous detections at that point.

@mitlib mitlib temporarily deployed to tacos-api-pipeline-pr-65 July 24, 2024 15:21 Inactive
@JPrevost JPrevost requested review from matt-bernhardt and jazairi and removed request for matt-bernhardt July 24, 2024 16:58
@jazairi jazairi self-assigned this Jul 30, 2024
Copy link
Contributor

@jazairi jazairi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exciting feature, thanks for this! While I agree that the current iteration of exact title matching may not be predictive of user intent, I think this data will help us better understand what our users are searching for (and how to refine the algorithm).

@@ -38,6 +39,11 @@ class Algorithms < ActiveSupport::TestCase
assert aggregate.pmid == 1
end

test 'journal exact counts are included in monthly aggregation' do
aggregate = Metrics::Algorithms.new.generate(DateTime.now)
assert aggregate.journal_exact == 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry a bit that this type of testing will lead to similar issues as we've seen in ETD (i.e.,adding a fixture breaks a bunch of tests). Not requesting a change, partly because I don't have any good suggestions and partly because I'd love to get this ticket to done.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you are definitely correct. As more fixtures are added, this test could fail.

I don't have a great idea that doesn't involve just not using fixtures or reimplementing the algorithm that calculates matches in the test itself... neither of which are really acceptable.

Thanks for pointing this out as I do think it's important for us to acknowledge when these "future" problems are being introduced.

Why are these changes being introduced:

* Understanding our ability to detect search intent over time is
  core to our ability to know how we are doing

Relevant ticket(s):

* https://mitlibraries.atlassian.net/browse/TCO-54

How does this address that need:

* Adds new field to track exact journal matches to Metrics::Algorithms
* Updates Metrics::Algorithms to run journal exact matches in addition
  to the existing StandardIdentifer matches. This included a refactor
  to better support multiple match types.

Document any side effects to this change:

Journal exact matching is not guaranteed to be an indicator of user
search intent because Journal names are also common words in many cases.
When we build our our validation workflows, we'll be able to understand
what percentage of these types of matches are definitely Journals and
what percentage is ambiguous. We can likey update our algorithm to drop
some of the more ambiguous detections at that point.
@JPrevost JPrevost force-pushed the tco54-historical-snapshot-journals branch from e43ae42 to 1e0fdc1 Compare July 31, 2024 12:44
@JPrevost JPrevost merged commit eec3780 into main Jul 31, 2024
1 check passed
@JPrevost JPrevost deleted the tco54-historical-snapshot-journals branch July 31, 2024 12:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants