-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add automatic fingerprinting for Term records #138
Changes from all commits
727c3cf
e15ce55
8ed8964
5a1dac1
1955b99
1c6ef72
aba9023
07f310a
051e7d7
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
# frozen_string_literal: true | ||
|
||
# == Schema Information | ||
# | ||
# Table name: fingerprints | ||
# | ||
# id :integer not null, primary key | ||
# value :string | ||
# created_at :datetime not null | ||
# updated_at :datetime not null | ||
# | ||
class Fingerprint < ApplicationRecord | ||
has_many :terms, dependent: :nullify | ||
|
||
validates :value, uniqueness: true | ||
|
||
alias_attribute :fingerprint_value, :value | ||
|
||
# This is similar to the SuggestedResource fingerprint method, with the exception that it also replaces " with " | ||
# during its operation. This switch may also need to be added to the SuggestedResource method, at which point they can | ||
# be abstracted to a helper method. | ||
def self.calculate(phrase) | ||
modified = phrase | ||
modified = modified.strip | ||
modified = modified.downcase | ||
modified = modified.gsub('"', '"') # This line does not exist in SuggestedResource implementation. | ||
modified = modified.gsub(/\p{P}|\p{S}/, '') | ||
modified = modified.to_ascii | ||
modified = modified.gsub(/\p{P}|\p{S}/, '') | ||
tokens = modified.split | ||
tokens = tokens.uniq | ||
tokens = tokens.sort | ||
tokens.join(' ') | ||
end | ||
end |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -7,21 +7,40 @@ | |
# | ||
# Table name: terms | ||
# | ||
# id :integer not null, primary key | ||
# phrase :string | ||
# created_at :datetime not null | ||
# updated_at :datetime not null | ||
# flag :boolean | ||
# id :integer not null, primary key | ||
# phrase :string | ||
# created_at :datetime not null | ||
# updated_at :datetime not null | ||
# flag :boolean | ||
# fingerprint_id :integer | ||
# | ||
class Term < ApplicationRecord | ||
has_many :search_events, dependent: :destroy | ||
has_many :detections, dependent: :destroy | ||
has_many :categorizations, dependent: :destroy | ||
has_many :confirmations, dependent: :destroy | ||
belongs_to :fingerprint, optional: true | ||
|
||
before_save :register_fingerprint | ||
after_destroy :check_fingerprint_count | ||
|
||
scope :user_confirmed, -> { where.associated(:confirmations).distinct } | ||
scope :user_unconfirmed, -> { where.missing(:confirmations).distinct } | ||
|
||
# The fingerprint method returns the constructed fingerprint field from the related Fingerprint record. In the | ||
# rare condition when no Fingerprint record exists, this method returns Nil. | ||
delegate :fingerprint_value, to: :fingerprint, allow_nil: true | ||
|
||
# The cluster method returns an array of all Term records which share a fingerprint with the current term. The term | ||
# itself is not returned, so if a term has no related records, this method returns an empty array. | ||
# | ||
# @note In the rare case when a Term has no fingerprint, this method returns Nil. | ||
# | ||
# @return array | ||
def cluster | ||
fingerprint&.terms&.filter { |rel| rel != self } | ||
end | ||
|
||
# The record_detections method is the one-stop method to call every Detector's record method that is defined within | ||
# the application. | ||
# | ||
|
@@ -68,6 +87,22 @@ def calculate_categorizations | |
|
||
private | ||
|
||
# register_fingerprint method gets called before a Term record is saved, ensuring that Terms should always have a | ||
# related Fingerprint method. | ||
def register_fingerprint | ||
new_record = { | ||
value: Fingerprint.calculate(phrase) | ||
} | ||
self.fingerprint = Fingerprint.find_or_create_by(new_record) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Non blocking. You may intentionally be making the hash, but I wanted to share it is not necessary and you could instead just pass the key and value directly without the intermediary object creation.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This two-step process was initially a reaction to a one-step process that used some variation on the word "fingerprint" about half a dozen times, which bordered on the nonsensical to read even 30 seconds later. While I agree that this isn't necessary, I'm inclined to keep it in this format, at least for now. Future us can refactor if it continues to bug us. |
||
end | ||
|
||
# This is called during the after_destroy hook. If removing that term means that its fingerprint is now abandoned, | ||
# then we destroy the fingerprint too. In the rare case when a Term does not have a fingerprint, this method does not | ||
# cause problems because of the safe operators in the conditional. | ||
def check_fingerprint_count | ||
fingerprint.destroy if fingerprint&.terms&.count&.zero? | ||
end | ||
|
||
# This method looks up all current detections for the given term, and assembles their confidence scores in a format | ||
# usable by the calculate_categorizations method. It exists to transform data like: | ||
# [{3=>0.91}, {1=>0.1}] and [{3=>0.95}] | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
class CreateFingerprints < ActiveRecord::Migration[7.1] | ||
def change | ||
create_table :fingerprints do |t| | ||
t.string :value, index: { unique: true, name: 'unique_fingerprint' } | ||
t.timestamps | ||
end | ||
end | ||
end |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
class AddFingerprintToTerms < ActiveRecord::Migration[7.1] | ||
def up | ||
add_reference :terms, :fingerprint, foreign_key: true | ||
end | ||
|
||
def down | ||
remove_reference :terms, :fingerprint, foreign_key: true | ||
end | ||
end |
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
# frozen_string_literal: true | ||
|
||
namespace :fingerprints do | ||
# generate will create (or re-create) fingerprints for all existing Terms. | ||
desc 'Generate fingerprints for all terms' | ||
task generate: :environment do |_task| | ||
Rails.logger.info("Generating fingerprints for all #{Term.count} terms") | ||
|
||
Term.find_each.with_index do |t, index| | ||
t.save | ||
Rails.logger.info("Processed #{index}") if index == (index / 1000) * 1000 | ||
end | ||
end | ||
end |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
# == Schema Information | ||
# | ||
# Table name: fingerprints | ||
# | ||
# id :integer not null, primary key | ||
# value :string | ||
# created_at :datetime not null | ||
# updated_at :datetime not null | ||
# | ||
cool: | ||
value: cool search super | ||
|
||
hi: | ||
value: hello world | ||
|
||
pmid_38908367: | ||
value: '2024 38908367 activation aging al and cell dna et hallmarks hs methylation multiple pmid shim targets tert' | ||
|
||
lcsh: | ||
value: 'geology massachusetts' | ||
|
||
issn_1075_8623: | ||
value: '10758623' | ||
|
||
doi: | ||
value: '101016jphysio201012004' | ||
|
||
isbn_9781319145446: | ||
value: '11th 2016 9781319145446 al biology d e ed et freeman h hillis isbn life m of sadava science the w' | ||
|
||
journal_nature_medicine: | ||
value: 'medicine nature' | ||
|
||
suggested_resource_jstor: | ||
value: 'jstor' | ||
|
||
multiple_detections: | ||
value: '103389fpubh202000014 32154200 a air and doi environmental frontiers health impacts in of pmid pollution public review' | ||
|
||
citation: | ||
value: '12 2 2005 2007 6 a accessed altun available context current dec education experience httpcieedasueduvolume6number12 hypertext in issues july language learners no of on online reading serial the understanding vol web' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non blocking. I'm not sure this alias will be useful.
Term.fingerprint.value
vsTerm.fingerprint_value
. If you like the ergonomics of that that definitely keep it.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, I think I like the ergonomics of having it directly on the
Term
model. This may turn out to be something we remove later, which would be fine, but I'd rather deal with that down the road at this point.