Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calibrate search relevance #186

Open
marlo-longley opened this issue Nov 9, 2022 · 13 comments
Open

Calibrate search relevance #186

marlo-longley opened this issue Nov 9, 2022 · 13 comments
Assignees

Comments

@marlo-longley
Copy link
Collaborator

This will take back and forth about what users / stakeholders are expecting to retrieve
Blocked by complete and final data ingest

@marlo-longley
Copy link
Collaborator Author

https://docs.google.com/spreadsheets/d/1GcQnsKyq_UEY4hboU4edmClLru0um1vpQNpxN6iN8CI/edit#gid=0
These are the top search terms in VT Spotlight

@laurensorensen
Copy link
Collaborator

laurensorensen commented Dec 9, 2022

This will take back and forth about what users / stakeholders are expecting to retrieve Blocked by complete and final data ingest

@marlo-longley Is there anything I can do to help with this? It seems like since the data ingest is now complete and in prod, this might be able to be worked on?

@marlo-longley
Copy link
Collaborator Author

This is blocked by indexing fulltext unfortunately. @laurensorensen

@marlo-longley
Copy link
Collaborator Author

@laurensorensen -- work on this can begin now.

@thatbudakguy
Copy link
Member

@laurensorensen if you have time, i think it would be great to check the top terms from spotlight against the current staging app to see if the result numbers seem relatively normal. if we need to substantially change the relevancy ranking, it would be good to know that sooner rather than later

@marlo-longley
Copy link
Collaborator Author

In the course of testing my reindexing (to check if H-2785 had been added), I entered "H-2785" into the fulltext search bar. I got 9,600 results, which seem to be matching on single characters. This is not great search relevance. Can we make a minimum number of characters for a match in Solr?

Image

@corylown
Copy link
Contributor

@marlo-longley We should check the mm setting. Typically you'd want both tokens (H and 2785) in a short query like this to exist in the document for a match.

@marlo-longley
Copy link
Collaborator Author

marlo-longley commented Jan 10, 2023

@corylown

<str name="mm">2&lt;-1 5&lt;-2 6&lt;90%</str>
seems to have logic that isn't working.

@corylown
Copy link
Contributor

corylown commented Jan 10, 2023

@marlo-longley it could be helpful to look at the query debug info in solr to see what's going on. Maybe we can find some time to pair on it later this week.

If I'm understanding the docs right 2<-1 5<-2 6<90% equates to:

  • if there are 1 or 2 clauses all are required
  • if there are 3 to 5 clauses all but one is required
  • if there are 6 clauses all but two are required
  • if there are more than 6 clauses then 90% are required

@corylown
Copy link
Contributor

@marlo-longley I'm suspicious of this second application of the WordDelimiterGraphFilterFactory in the query analyzer for the identifier_match match field. Maybe the matching behavior we're seeing is because the index and query analysis is different. We might also need to adjust the field boost.

Screen Shot 2023-01-11 at 1 08 03 PM

@corylown
Copy link
Contributor

corylown commented Jan 19, 2023

@marlo-longley I think I found the reason why we're still seeing some unexpected matches when searching for identifiers like h-1303. This particular problem is caused by the ordering of the analysis chain in the text_en field type that is used by the title_tesim field. You can see the matching behavior by limiting your query to just that field, like this:
/solr/nta-arclight-prod/select?indent=true&q.op=OR&q=title_tesim%3Ah-1302

This is the current text_en field definition:

    <fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory" />
        <filter class="solr.KeywordRepeatFilterFactory" />
        <filter class="solr.WordDelimiterGraphFilterFactory"/>
        <filter class="solr.EnglishPossessiveFilterFactory"/>
        <filter class="solr.ICUFoldingFilterFactory" />
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
        <filter class="solr.PorterStemFilterFactory"/>
        <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt"/>
        <filter class="solr.FlattenGraphFilterFactory"/> <!-- required on index analyzers after graph filters -->
        <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory" />
        <filter class="solr.KeywordRepeatFilterFactory" />
        <filter class="solr.WordDelimiterGraphFilterFactory"/>
        <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt"/>
        <filter class="solr.ICUFoldingFilterFactory" />
        <filter class="solr.EnglishPossessiveFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
        <filter class="solr.PorterStemFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
      </analyzer>
    </fieldType>

The placement of the KeywordRepeatFilterFactory before the WordDelimiterGraphFilterFactory has an undesirable effect on the position of the duplicated tokens that prevents the RemoveDuplicatesTokenFilterFactory from removing the duplicate tokens in the analysis final step.

Notice the token positions:
Screen Shot 2023-01-19 at 10 22 00 AM

And in the final step the result is that the duplicates are retained because the positioning prevents the RemoveDuplicatesTokenFilterFactory from recognizing the duplicate tokens.

Screen Shot 2023-01-19 at 10 34 29 AM

If I move the KeywordRepeatFilterFactory to the step before the KeywordMarkerFilterFactory we don't see the shift in token positions and the RemoveDuplicatesTokenFilterFactory is able to remove the duplicate tokens. This change has the effect of more expected mm behavior and you get more precise results when searching for an id like h-1302.

WordDelimiterGraphFilterFactory placed before the KeywordRepeatFilterFactory
WordDelimiterGraphFilterFactory

Notice that the duplicated token positions output by KeywordRepeatFilterFactory are now correct:
KeywordRepeatFilterFactory

And in the final step the RemoveDuplicatesTokenFilterFactory is able to identify and correctly remove the duplicate tokens.
RemoveDuplicatesTokenFilterFactory

In my testing this fixes the undesirable identifier matching behavior we're seeing. It will take more investigation to determine if making this change has undesirable effects on other searches.

@marlo-longley
Copy link
Collaborator Author

@corylown thank you for this super detailed investigation and write-up! This is where some sample searches or conversation with the PO could be helpful I believe -- in order to determine any undesirable effects from making this change, it would be good to verify that a defined group of searches remain as expected. I can talk with Lauren about this to determine priority.

@laurensorensen
Copy link
Collaborator

@marlo-longley @corylown Is there anything I can do to help with this ticket? Not sure I understand fully but happy to find a time to talk about it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants