Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add automatic fingerprinting for Term records #138

Merged
merged 9 commits into from
Dec 13, 2024
Merged

Conversation

matt-bernhardt
Copy link
Member

@matt-bernhardt matt-bernhardt commented Nov 13, 2024

This adds a TermFingerprint model, which allows for clustering of terms by a shared fingerprint. Full details are in the commit message.

Developer

Ticket(s)

This started as an experiment that kept seeming promising, so I didn't create a ticket for it. Now that I'm writing the PR text, I think maybe I missed an opportunity to do this - but am not going to paper over that decision now without checking in with the project team.

https://mitlibraries.atlassian.net/browse/TCO-124

Accessibility

  • ANDI or Wave has been run in accordance to our guide and
    all issues introduced by these changes have been resolved or opened
    as new issues (link to those issues in the Pull Request details above)
  • There are no accessibility implications to this change

Documentation

  • Project documentation has been updated, and yard output previewed
  • No documentation changes are needed

ENV

  • All new ENV is documented in README.
  • All new ENV has been added to Heroku Pipeline, Staging and Prod.
  • ENV has not changed.

Stakeholders

  • Stakeholder approval has been confirmed
  • Stakeholder approval is not needed

Dependencies and migrations

NO dependencies are updated

YES migrations are included

Reviewer

Code

  • I have confirmed that the code works as intended.
  • Any CodeClimate issues have been fixed or confirmed as
    added technical debt.

Documentation

  • The commit message is clear and follows our guidelines
    (not just this pull request message).
  • The documentation has been updated or is unnecessary.
  • New dependencies are appropriate or there were no changes.

Testing

  • There are appropriate tests covering any new functionality.
  • No additional test coverage is required.

@mitlib mitlib temporarily deployed to tacos-api-pipeline-pr-138 November 13, 2024 21:50 Inactive
@mitlib mitlib temporarily deployed to tacos-api-pipeline-pr-138 November 13, 2024 21:59 Inactive
@JPrevost JPrevost self-assigned this Nov 14, 2024
jazairi added a commit that referenced this pull request Nov 22, 2024
Why these changes are being introduced:

We currently use fingerprinting for suggested resources. We are planning
to expand this functionality to terms to support clustering, and we
also need to refactor suggested resource fingerprinting to accommodate
multiple phrases/fingerprints per suggested resource. This felt like
a good opportunity to reevaluate our fingerprinting implementation.

Relevant ticket(s):

* [TCO-74](https://mitlibraries.atlassian.net/browse/TCO-74)
* [Add automatic fingerprinting for Term records (unticketed -- link to PR)](#138)

How this addresses that need:

This ADR proposes that we add a central `Fingerprint` model, linked
to `Terms`, which is leveraged by `SuggestedResource` and any other
detectors that require fingerprinting.

EngX had discussed this as a possible approach in a recent meeting.
Consensus on this ADR would confirm the decision.

Side effects of this change:

See ADR for details.
@jazairi jazairi mentioned this pull request Nov 22, 2024
15 tasks
Copy link
Member

@JPrevost JPrevost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking:

  • I'd like to see some unit tests for the fingerprint algorithm.

Suggestions (non-blocking):

  • rename TermFingerprint to Fingerprint
  • move fingerprinting logic out of Term and into Fingerprint (I'm nearly blocking on this so please consider it strongly and if you don't think it is the correct path let's chat so I can understand why)
  • use assert_nil


assert_equal term_count, Term.count
assert_equal fingerprint_count - 1, TermFingerprint.count
assert_instance_of NilClass, target_term.term_fingerprint
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may want to consider assert_nil here:
https://ruby-doc.org/3.0.5/gems/minitest/Minitest/Assertions.html#method-i-assert_nil

Both work, but I think the more explicit assertion is slightly more efficient to mentally process (for me at least)

# This is similar to the SuggestedResource fingerprint method, with the exception that it also replaces " with "
# during its operation. This switch may also need to be added to the SuggestedResource method, at which point they can
# be abstracted to a helper method.
def calculate_fingerprint
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any tests that confirm this logic works as expected.

You can likely snag them from https://github.com/MITLibraries/tacos/blob/main/test/models/detector/suggested_resource_test.rb
as we'll be refactoring and removing those tests soon to use the Term fingerprints.

This feels too important to hide behind "we don't test private methods" to me. If you don't test it directly (which is fine for a private method), we still should test the effects of it by testing the generated fingerprints on a Term (although it's probably slightly easier to just test it directly).

I'm also curious if this might better belong in the Fingerprint model itself. It feels slightly odd to have a model dedicated to fingerprints but have the actual logic in Term.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thankfully, ending up with an instance method for this logic means that we can't hide behind "we don't test private methods" anymore. Either way, I agree that tests are needed here - not sure why I didn't write them initially.

I've tried to make clear in the test structure that there's a test for every line in the method, to make sure we keep focus on its components rather than trying to manage one all-encompassing example. Hopefully future-us will continue to keep up if this logic changes.

# created_at :datetime not null
# updated_at :datetime not null
#
class TermFingerprint < ApplicationRecord
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we have decided to only have one set of stored fingerprints, it feels worth considering whether to name this Fingerprint instead of TermFingerprint. This is non-blocking, but if you choose to not make this change I'd like to hear why you prefer it this way.

As noted elsewhere, I'd prefer if we can move the logic to calculate a fingerprint to this model rather than keep it in Term.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have thoughts about naming the class versus naming the field? Using the same name for both feels like it would get awkward, particularly from the perspective of the Term model. term.fingerprint feels like the method to call to get the calculated string, but if the class is also Fingerprint then things get a bit weird.

For now I'm using .fingerprint_value as the Term method, but I'm not sure that's going to be the case when it gets to code review.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I'm particularly interested in your feedback about is the naming being used here. Having a Fingerprint model, and a fingerprint field, feels awkward, particularly from the Term perspective. I've tried to resolve this by adding the .fingerprint_value construct that delegates down to the Fingerprint model, and allows the Term to just get access to the string value somehow.

That said, I don't think this gets used anywhere yet - so maybe it would makes sense to remove it until the UI or something leverages it.

@matt-bernhardt
Copy link
Member Author

All of those suggestions look good to me, thanks - I'll try to get these wrapped up either today or Monday.

@matt-bernhardt
Copy link
Member Author

My question about moving TermFingerprint to Fingerprint (which makes sense on paper) is whether I should retcon the two migrations in this PR to just use that name, or instead create new migrations to drop this table and create the new one. I suspect that the answer is to retcon, and just re-create the review app for this and advise everyone to rollback / migrate carefully - but that feels like something to confirm with you and Adam.

@JPrevost
Copy link
Member

JPrevost commented Dec 9, 2024

@matt-bernhardt I'd go with retcon. Make sure to rollback your local database before doing that or the retconned tables will never go away for you. And yeah, for the PR build just make a new one to avoid any confusion.

@matt-bernhardt matt-bernhardt marked this pull request as draft December 10, 2024 21:44
@matt-bernhardt
Copy link
Member Author

I'm putting this into draft mode while I work on this refactoring.

@matt-bernhardt matt-bernhardt marked this pull request as ready for review December 11, 2024 21:52
@matt-bernhardt
Copy link
Member Author

Okay @JPrevost - I think the fingerprinting PR is once again ready for a review. I think I've picked up the things you asked for changes on, and I think things are better as a result. Let me know what you think?

add_reference :terms, :fingerprint, foreign_key: true

# Seed the relationship between Terms and Fingerprints
Term.all.each do |t|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This took a very long time and then failed locally for me with a large set of Terms loaded.

I'd like to load the full set of Terms/SearchEvents into stage before we merge to confirm we won't have issues in prod. That is likely blocked by stage db size so I'll pull that ticket to unblock this work.

For now I'll send you the error I saw via Slack and I suggest you load the full set from prod (you have to load in batches) so you can try to replicate the problem I ran into. Loading in batches by month is how I've approached this as we can't export the full set at this time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, I want to consider not doing this large data process in a migration (I know, I know) and instead making it a rake task we run after migration. If I understand this work, fingerprint is not required and thus the migration should be fine without it and we can have a task that regenerates all fingerprints to run manually after this work lands both to generate the full set of fingerprints initially but also in case our fingerprint algorithm ever changes this would allow us to regenerate them.

Copy link
Member

@JPrevost JPrevost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking: I'd like us to consider Fingerprint.value instead of Fingerprint.fingerprint. It's not great, but it feels slightly less confusing.

Also non-blocking, but consider whether the default state in fixtures of Terms having no Fingerprints feels correct to you.

If you don't intend to make either of those changes just let me know and I'll approve as the code works as needed. Thanks!

# Table name: fingerprints
#
# id :integer not null, primary key
# fingerprint :string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we used value here? It is not ideal, but it feels less awkward than Term.fingerprint.fingerprint and doesn't hide behind the alias.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems worth trying, yeah - I'll take a run at it, and see if I can make it work.

test/fixtures/terms.yml Show resolved Hide resolved
@mitlib mitlib temporarily deployed to tacos-api-pipeline-pr-138 December 13, 2024 18:50 Inactive
@mitlib mitlib temporarily deployed to tacos-api-pipeline-pr-138 December 13, 2024 19:02 Inactive
** Why are these changes being introduced:

After watching our search traffic for a few months, we have seen some
terms come in that are clearly related:

* 'Scientific American'
* 'scientific american'
* '&quot;Scientific american&quot;'

While there are valid reasons to store the term exactly as the user has
submitted it - there are also good reasons to standardize these values,
to look for clusters and related terms.

** Relevant ticket(s):

n/a

** How does this address that need:

This defines a TermFingerprint model, which is related to the Term model
via a belongs_to relationship (a Term belongs to its TermFingerprint,
and a TermFingerprint can have many Terms). The migration to define this
model includes a migration to populate records for Terms we've already
received.

The Term model gets two lifecycle hooks, one to add new fingerprints for
every term, and another to delete a TermFingerprint if no remaining
Terms reference it.

Beyond the methods needed to handle these creations and deletions in as
seemless a way as possible, we also add a .cluster method to the Term
model, which will return an array of all related Terms (but not the Term
itself - so a Term that has a unique fingerprint will return an empty
array). This method will be used as part of a future inspection UI.

Building the models in this way will also allow for querying based on
shared fingerprints - for example by adding a :has_many_terms scope
on the TermFingerprint model.

** Document any side effects to this change:

Two things:

1. A quirk of this implementation is that it is possible to delete a
TermFingerprint record, which does not delete the related Terms (and
thus also SearchEvents, Detections, and Categorizations). The Term will
then have a null fingerprint. The next time the Term is saved, its
fingerprint will be regenerated. No other operation should be impacted
by this arrangement.

For the life of me I can't anticipate why we might delete a fingerprint
- but the relationship needs to be optional to avoid further pain when
working in the console.

2. The process for calculating this TermFingerprint value is similar -
but not identical - to that for handling the SuggestedResource records.
The difference is that the TermFingerprint method removes "&quot;"
sequences - which might need to be added to SuggestedResource. Both
methods should probably be abstracted out to a shared helper method,
honestly.
I noticed this when running make lint - not sure how it slipped
past my notice when working on the confirmation work.
Rename migration to get date sort correct

Run migrations
** Why are these changes being introduced:

It makes sense that the Fingerprint model is where the logic to
calculate a fingerprint value is maintained, and not in the Term model.

** Relevant ticket(s):

code review

** How does this address that need:

This moves the fingerprint logic from the Term to Fingerprint model.
Because of this, the Term model changes to use its new location (from
inside the lifecycle hook where the Fingerprint record gets created).

We use an instance method for this, because at the time of use there is
not yet a Fingerprint record that could call the method internally.

We also copy-paste the tests for this method from the SuggestedResource
implementation (which should be removed in a future PR).

** Document any side effects to this change:

I can think of two possible side effects:
1. the SuggestedResource implementation of the fingerprint is now very
much duplicative, and should be removed (coming in a future ticket)

2. It is a bit awkward to rely on an instance method for calculating
the fingerprint value. In an ideal case, register_fingerprint would
use something like:

self.fingerprint = Fingerprint.find_or_create_by(phrase)

... and then the Fingerprint model would:
a. receive the phrase argument
b. calculate the fingerprint for this phrase
c. look up to see if such a record exists already, and return it
d. create the record if none exists, and return _that_

A custom initialize method feels like it would work for this, but I
think that's considered an antipattern in Rails?

For now, I think this works.
This comments out the initial calculation of Fingerprint records from
the migrations in this PR. A following commit will add a rake task that
will focus on this work.

A side effect of this work is that we need to change the fixtures used
for our test suite, to match the state the migrations will produce. This
in turn causes some tests to change, because some tests will require
additional setup work.

More complex tests now have code comments to indicate this sort of
sequencing.
Includes a very rudimentary test, and finally removing the (commented out) block from the db migration.
Three tests need to be updated to account for the different starting
position.
Until now, we've been building the Fingerprint model to include a field
also named "fingerprint" for the actual string value of the fingerprint.
This removes that duplication by changing the field name to be "value".

As a result of updating the migration with this new terminology, we also
update the alias / delegation up to the Term model, so terms can still
find the string via the term.fingerprint_value attribute.

The fixtures use the updated field name, and some references to the new
field name (in application code and in tests) are updated.
@mitlib mitlib temporarily deployed to tacos-api-pipeline-pr-138 December 13, 2024 19:16 Inactive
@matt-bernhardt
Copy link
Member Author

@JPrevost I've added two commits this afternoon, changing the fingerprint field to value, and updating the state of the fixtures. I feel pretty good about both changes - not "this is perfect", but "this is, on balance, better than it was". It's ready for your review.


validates :value, uniqueness: true

alias_attribute :fingerprint_value, :value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non blocking. I'm not sure this alias will be useful. Term.fingerprint.value vs Term.fingerprint_value. If you like the ergonomics of that that definitely keep it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, I think I like the ergonomics of having it directly on the Term model. This may turn out to be something we remove later, which would be fine, but I'd rather deal with that down the road at this point.

new_record = {
value: Fingerprint.calculate(phrase)
}
self.fingerprint = Fingerprint.find_or_create_by(new_record)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non blocking. You may intentionally be making the hash, but I wanted to share it is not necessary and you could instead just pass the key and value directly without the intermediary object creation.

self.fingerprint = Fingerprint.find_or_create_by(value: Fingerprint.calculate(phrase))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This two-step process was initially a reaction to a one-step process that used some variation on the word "fingerprint" about half a dozen times, which bordered on the nonsensical to read even 30 seconds later.

While I agree that this isn't necessary, I'm inclined to keep it in this format, at least for now. Future us can refactor if it continues to bug us.

@matt-bernhardt matt-bernhardt merged commit beae973 into main Dec 13, 2024
6 checks passed
@matt-bernhardt matt-bernhardt deleted the term-fingerprints branch December 13, 2024 20:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants