Calibrate search relevance #186

marlo-longley · 2022-11-09T23:24:27Z

This will take back and forth about what users / stakeholders are expecting to retrieve
Blocked by complete and final data ingest

marlo-longley · 2022-11-18T18:54:15Z

https://docs.google.com/spreadsheets/d/1GcQnsKyq_UEY4hboU4edmClLru0um1vpQNpxN6iN8CI/edit#gid=0
These are the top search terms in VT Spotlight

laurensorensen · 2022-12-09T21:00:09Z

This will take back and forth about what users / stakeholders are expecting to retrieve Blocked by complete and final data ingest

@marlo-longley Is there anything I can do to help with this? It seems like since the data ingest is now complete and in prod, this might be able to be worked on?

marlo-longley · 2022-12-09T21:08:02Z

This is blocked by indexing fulltext unfortunately. @laurensorensen

marlo-longley · 2022-12-14T20:50:12Z

@laurensorensen -- work on this can begin now.

thatbudakguy · 2023-01-06T21:17:48Z

@laurensorensen if you have time, i think it would be great to check the top terms from spotlight against the current staging app to see if the result numbers seem relatively normal. if we need to substantially change the relevancy ranking, it would be good to know that sooner rather than later

marlo-longley · 2023-01-10T22:22:08Z

In the course of testing my reindexing (to check if H-2785 had been added), I entered "H-2785" into the fulltext search bar. I got 9,600 results, which seem to be matching on single characters. This is not great search relevance. Can we make a minimum number of characters for a match in Solr?

corylown · 2023-01-10T22:31:33Z

@marlo-longley We should check the mm setting. Typically you'd want both tokens (H and 2785) in a short query like this to exist in the document for a match.

marlo-longley · 2023-01-10T22:51:02Z

@corylown

vt-arclight/solr/conf/solrconfig.xml

Line 78 in ffb70fe

seems to have logic that isn't working.

corylown · 2023-01-10T23:34:42Z

@marlo-longley it could be helpful to look at the query debug info in solr to see what's going on. Maybe we can find some time to pair on it later this week.

If I'm understanding the docs right 2<-1 5<-2 6<90% equates to:

if there are 1 or 2 clauses all are required
if there are 3 to 5 clauses all but one is required
if there are 6 clauses all but two are required
if there are more than 6 clauses then 90% are required

corylown · 2023-01-18T16:39:34Z

@marlo-longley I'm suspicious of this second application of the WordDelimiterGraphFilterFactory in the query analyzer for the identifier_match match field. Maybe the matching behavior we're seeing is because the index and query analysis is different. We might also need to adjust the field boost.

corylown · 2023-01-19T15:45:36Z

@marlo-longley I think I found the reason why we're still seeing some unexpected matches when searching for identifiers like h-1303. This particular problem is caused by the ordering of the analysis chain in the text_en field type that is used by the title_tesim field. You can see the matching behavior by limiting your query to just that field, like this:
/solr/nta-arclight-prod/select?indent=true&q.op=OR&q=title_tesim%3Ah-1302

This is the current text_en field definition:

    <fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory" />
        <filter class="solr.KeywordRepeatFilterFactory" />
        <filter class="solr.WordDelimiterGraphFilterFactory"/>
        <filter class="solr.EnglishPossessiveFilterFactory"/>
        <filter class="solr.ICUFoldingFilterFactory" />
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
        <filter class="solr.PorterStemFilterFactory"/>
        <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt"/>
        <filter class="solr.FlattenGraphFilterFactory"/> <!-- required on index analyzers after graph filters -->
        <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory" />
        <filter class="solr.KeywordRepeatFilterFactory" />
        <filter class="solr.WordDelimiterGraphFilterFactory"/>
        <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt"/>
        <filter class="solr.ICUFoldingFilterFactory" />
        <filter class="solr.EnglishPossessiveFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
        <filter class="solr.PorterStemFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
      </analyzer>
    </fieldType>

The placement of the KeywordRepeatFilterFactory before the WordDelimiterGraphFilterFactory has an undesirable effect on the position of the duplicated tokens that prevents the RemoveDuplicatesTokenFilterFactory from removing the duplicate tokens in the analysis final step.

Notice the token positions:

And in the final step the result is that the duplicates are retained because the positioning prevents the RemoveDuplicatesTokenFilterFactory from recognizing the duplicate tokens.

If I move the KeywordRepeatFilterFactory to the step before the KeywordMarkerFilterFactory we don't see the shift in token positions and the RemoveDuplicatesTokenFilterFactory is able to remove the duplicate tokens. This change has the effect of more expected mm behavior and you get more precise results when searching for an id like h-1302.

WordDelimiterGraphFilterFactory placed before the KeywordRepeatFilterFactory

Notice that the duplicated token positions output by KeywordRepeatFilterFactory are now correct:

And in the final step the RemoveDuplicatesTokenFilterFactory is able to identify and correctly remove the duplicate tokens.

In my testing this fixes the undesirable identifier matching behavior we're seeing. It will take more investigation to determine if making this change has undesirable effects on other searches.

marlo-longley · 2023-01-19T17:03:20Z

@corylown thank you for this super detailed investigation and write-up! This is where some sample searches or conversation with the PO could be helpful I believe -- in order to determine any undesirable effects from making this change, it would be good to verify that a defined group of searches remain as expected. I can talk with Lauren about this to determine priority.

laurensorensen · 2023-02-01T00:44:34Z

@marlo-longley @corylown Is there anything I can do to help with this ticket? Not sure I understand fully but happy to find a time to talk about it

thatbudakguy assigned laurensorensen Jan 6, 2023

This was referenced Jan 18, 2023

Improve results when searching for H-* identifiers #479

Closed

Improve results when searching for unitids projectblacklight/arclight#1389

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calibrate search relevance #186

Calibrate search relevance #186

marlo-longley commented Nov 9, 2022

marlo-longley commented Nov 18, 2022

laurensorensen commented Dec 9, 2022 •

edited

Loading

marlo-longley commented Dec 9, 2022

marlo-longley commented Dec 14, 2022

thatbudakguy commented Jan 6, 2023

marlo-longley commented Jan 10, 2023

corylown commented Jan 10, 2023

marlo-longley commented Jan 10, 2023 •

edited

Loading

corylown commented Jan 10, 2023 •

edited

Loading

corylown commented Jan 18, 2023

corylown commented Jan 19, 2023 •

edited

Loading

marlo-longley commented Jan 19, 2023

laurensorensen commented Feb 1, 2023

Calibrate search relevance #186

Calibrate search relevance #186

Comments

marlo-longley commented Nov 9, 2022

marlo-longley commented Nov 18, 2022

laurensorensen commented Dec 9, 2022 • edited Loading

marlo-longley commented Dec 9, 2022

marlo-longley commented Dec 14, 2022

thatbudakguy commented Jan 6, 2023

marlo-longley commented Jan 10, 2023

corylown commented Jan 10, 2023

marlo-longley commented Jan 10, 2023 • edited Loading

corylown commented Jan 10, 2023 • edited Loading

corylown commented Jan 18, 2023

corylown commented Jan 19, 2023 • edited Loading

marlo-longley commented Jan 19, 2023

laurensorensen commented Feb 1, 2023

laurensorensen commented Dec 9, 2022 •

edited

Loading

marlo-longley commented Jan 10, 2023 •

edited

Loading

corylown commented Jan 10, 2023 •

edited

Loading

corylown commented Jan 19, 2023 •

edited

Loading