Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add automatic fingerprinting for Term records #138

Merged
merged 9 commits into from
Dec 13, 2024
35 changes: 35 additions & 0 deletions app/models/fingerprint.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# frozen_string_literal: true

# == Schema Information
#
# Table name: fingerprints
#
# id :integer not null, primary key
# value :string
# created_at :datetime not null
# updated_at :datetime not null
#
class Fingerprint < ApplicationRecord
has_many :terms, dependent: :nullify

validates :value, uniqueness: true

alias_attribute :fingerprint_value, :value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non blocking. I'm not sure this alias will be useful. Term.fingerprint.value vs Term.fingerprint_value. If you like the ergonomics of that that definitely keep it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, I think I like the ergonomics of having it directly on the Term model. This may turn out to be something we remove later, which would be fine, but I'd rather deal with that down the road at this point.


# This is similar to the SuggestedResource fingerprint method, with the exception that it also replaces &quot; with "
# during its operation. This switch may also need to be added to the SuggestedResource method, at which point they can
# be abstracted to a helper method.
def self.calculate(phrase)
modified = phrase
modified = modified.strip
modified = modified.downcase
modified = modified.gsub('&quot;', '"') # This line does not exist in SuggestedResource implementation.
modified = modified.gsub(/\p{P}|\p{S}/, '')
modified = modified.to_ascii
modified = modified.gsub(/\p{P}|\p{S}/, '')
tokens = modified.split
tokens = tokens.uniq
tokens = tokens.sort
tokens.join(' ')
end
end
45 changes: 40 additions & 5 deletions app/models/term.rb
Original file line number Diff line number Diff line change
Expand Up @@ -7,21 +7,40 @@
#
# Table name: terms
#
# id :integer not null, primary key
# phrase :string
# created_at :datetime not null
# updated_at :datetime not null
# flag :boolean
# id :integer not null, primary key
# phrase :string
# created_at :datetime not null
# updated_at :datetime not null
# flag :boolean
# fingerprint_id :integer
#
class Term < ApplicationRecord
has_many :search_events, dependent: :destroy
has_many :detections, dependent: :destroy
has_many :categorizations, dependent: :destroy
has_many :confirmations, dependent: :destroy
belongs_to :fingerprint, optional: true

before_save :register_fingerprint
after_destroy :check_fingerprint_count

scope :user_confirmed, -> { where.associated(:confirmations).distinct }
scope :user_unconfirmed, -> { where.missing(:confirmations).distinct }

# The fingerprint method returns the constructed fingerprint field from the related Fingerprint record. In the
# rare condition when no Fingerprint record exists, this method returns Nil.
delegate :fingerprint_value, to: :fingerprint, allow_nil: true

# The cluster method returns an array of all Term records which share a fingerprint with the current term. The term
# itself is not returned, so if a term has no related records, this method returns an empty array.
#
# @note In the rare case when a Term has no fingerprint, this method returns Nil.
#
# @return array
def cluster
fingerprint&.terms&.filter { |rel| rel != self }
end

# The record_detections method is the one-stop method to call every Detector's record method that is defined within
# the application.
#
Expand Down Expand Up @@ -68,6 +87,22 @@ def calculate_categorizations

private

# register_fingerprint method gets called before a Term record is saved, ensuring that Terms should always have a
# related Fingerprint method.
def register_fingerprint
new_record = {
value: Fingerprint.calculate(phrase)
}
self.fingerprint = Fingerprint.find_or_create_by(new_record)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non blocking. You may intentionally be making the hash, but I wanted to share it is not necessary and you could instead just pass the key and value directly without the intermediary object creation.

self.fingerprint = Fingerprint.find_or_create_by(value: Fingerprint.calculate(phrase))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This two-step process was initially a reaction to a one-step process that used some variation on the word "fingerprint" about half a dozen times, which bordered on the nonsensical to read even 30 seconds later.

While I agree that this isn't necessary, I'm inclined to keep it in this format, at least for now. Future us can refactor if it continues to bug us.

end

# This is called during the after_destroy hook. If removing that term means that its fingerprint is now abandoned,
# then we destroy the fingerprint too. In the rare case when a Term does not have a fingerprint, this method does not
# cause problems because of the safe operators in the conditional.
def check_fingerprint_count
fingerprint.destroy if fingerprint&.terms&.count&.zero?
end

# This method looks up all current detections for the given term, and assembles their confidence scores in a format
# usable by the calculate_categorizations method. It exists to transform data like:
# [{3=>0.91}, {1=>0.1}] and [{3=>0.95}]
Expand Down
8 changes: 8 additions & 0 deletions db/migrate/20241210185701_create_fingerprints.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
class CreateFingerprints < ActiveRecord::Migration[7.1]
def change
create_table :fingerprints do |t|
t.string :value, index: { unique: true, name: 'unique_fingerprint' }
t.timestamps
end
end
end
9 changes: 9 additions & 0 deletions db/migrate/20241211195504_add_fingerprint_to_terms.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
class AddFingerprintToTerms < ActiveRecord::Migration[7.1]
def up
add_reference :terms, :fingerprint, foreign_key: true
end

def down
remove_reference :terms, :fingerprint, foreign_key: true
end
end
12 changes: 11 additions & 1 deletion db/schema.rb

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

17 changes: 13 additions & 4 deletions docs/reference/classes.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ classDiagram
direction LR

Term --> SearchEvent : has many
Fingerprint --> Term : has many

Term "1" --> "1..*" Detection
Term "1" --> "0..*" Categorization
Expand All @@ -41,8 +42,15 @@ classDiagram
class Term
Term: id
Term: +String phrase
Term: combinedScores()
Term: recordDetections()
Term: calculate_categorizations()
Term: calculate_confidence(values)
Term: cluster()
Term: fingerprint()
Term: record_detections()

class Fingerprint
Fingerprint: id
Fingerprint: +String fingerprint

class SearchEvent
SearchEvent: +Integer id
Expand Down Expand Up @@ -111,17 +119,17 @@ classDiagram

namespace SearchActivity{
class Term
class Fingerprint
class SearchEvent
}

namespace KnowledgeGraph{
class Detectors
class Detector
class DetectorCategory
class Category
}

namespace Detectors {
class Detector
class DetectorJournal["Detector::Journal"]
class DetectorLcsh["Detector::Lcsh"]
class DetectorStandardIdentifier["Detector::StandardIdentifiers"]
Expand All @@ -136,6 +144,7 @@ classDiagram

style SearchEvent fill:#000,stroke:#66c2a5,color:#66c2a5,stroke-width:4px;
style Term fill:#000,stroke:#66c2a5,color:#66c2a5,stroke-width:4px;
style Fingerprint fill:#000,stroke:#66c2a5,color:#66c2a5,stroke-width:4px;

style Category fill:#000,stroke:#fc8d62,color:#fc8d62
style DetectorCategory fill:#000,stroke:#fc8d62,color:#fc8d62
Expand Down
14 changes: 14 additions & 0 deletions lib/tasks/fingerprints.rake
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# frozen_string_literal: true

namespace :fingerprints do
# generate will create (or re-create) fingerprints for all existing Terms.
desc 'Generate fingerprints for all terms'
task generate: :environment do |_task|
Rails.logger.info("Generating fingerprints for all #{Term.count} terms")

Term.find_each.with_index do |t, index|
t.save
Rails.logger.info("Processed #{index}") if index == (index / 1000) * 1000
end
end
end
41 changes: 41 additions & 0 deletions test/fixtures/fingerprints.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# == Schema Information
#
# Table name: fingerprints
#
# id :integer not null, primary key
# value :string
# created_at :datetime not null
# updated_at :datetime not null
#
cool:
value: cool search super

hi:
value: hello world

pmid_38908367:
value: '2024 38908367 activation aging al and cell dna et hallmarks hs methylation multiple pmid shim targets tert'

lcsh:
value: 'geology massachusetts'

issn_1075_8623:
value: '10758623'

doi:
value: '101016jphysio201012004'

isbn_9781319145446:
value: '11th 2016 9781319145446 al biology d e ed et freeman h hillis isbn life m of sadava science the w'

journal_nature_medicine:
value: 'medicine nature'

suggested_resource_jstor:
value: 'jstor'

multiple_detections:
value: '103389fpubh202000014 32154200 a air and doi environmental frontiers health impacts in of pmid pollution public review'

citation:
value: '12 2 2005 2007 6 a accessed altun available context current dec education experience httpcieedasueduvolume6number12 hypertext in issues july language learners no of on online reading serial the understanding vol web'
26 changes: 21 additions & 5 deletions test/fixtures/terms.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,42 +2,58 @@
#
# Table name: terms
#
# id :integer not null, primary key
# phrase :string
# created_at :datetime not null
# updated_at :datetime not null
# flag :boolean
# id :integer not null, primary key
# phrase :string
# created_at :datetime not null
# updated_at :datetime not null
# flag :boolean
# fingerprint_id :integer
JPrevost marked this conversation as resolved.
Show resolved Hide resolved
#

cool:
phrase: Super cool search
fingerprint: cool

cool_cluster:
phrase: Super. Cool. Search.
fingerprint: cool

hi:
phrase: hello world
fingerprint: hi

pmid_38908367:
phrase: 'TERT activation targets DNA methylation and multiple aging hallmarks. Shim HS, et al. Cell. 2024. PMID: 38908367'
fingerprint: pmid_38908367

lcsh:
phrase: 'Geology -- Massachusetts'
fingerprint: lcsh

issn_1075_8623:
phrase: 1075-8623
fingerprint: issn_1075_8623

doi:
phrase: '10.1016/j.physio.2010.12.004'
fingerprint: doi

isbn_9781319145446:
phrase: 'Sadava, D. E., D. M. Hillis, et al. Life The Science of Biology. 11th ed. W. H. Freeman, 2016. ISBN: 9781319145446'
fingerprint: isbn_9781319145446

journal_nature_medicine:
phrase: 'nature medicine'
fingerprint: journal_nature_medicine

suggested_resource_jstor:
phrase: 'jstor'
fingerprint: suggested_resource_jstor

multiple_detections:
phrase: 'Environmental and Health Impacts of Air Pollution: A Review. Frontiers in Public Health. PMID: 32154200. DOI: 10.3389/fpubh.2020.00014'
fingerprint: multiple_detections

citation:
phrase: "A. Altun, &quot;Understanding hypertext in the context of reading on the web: Language learners' experience,&quot; Current Issues in Education, vol. 6, no. 12, July, 2005. [Online serial]. Available: http://cie.ed.asu.edu/volume6/number12/. [Accessed Dec. 2, 2007]."
fingerprint: citation
Loading
Loading