Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement exact query matching in Solr #1401

Open
5 tasks
jacobthill opened this issue Feb 3, 2022 · 1 comment
Open
5 tasks

Implement exact query matching in Solr #1401

jacobthill opened this issue Feb 3, 2022 · 1 comment

Comments

@jacobthill
Copy link
Contributor

jacobthill commented Feb 3, 2022

As a DLME curator/user, I sometimes query words and get back many false positive documents containing similarly spelled words that are not relevant to my query. e.g. when I search for “Masada” I expect records related to the place name Masada. The query returns 144 results (February 3, 2022). The first four contain the queried term. They are not relevant because they are modern records but DLME can’t be faulted for that. The rest seem to contain other terms with a similar spelling or nothing at all similar to my query, none of which seem to be relevant. Examples of terms these records contain that are probably being matched are “Mashhadi” and “Mas’ada”. In some cases words with similar spellings as the query term might be relevant so we may not want to turn off fuzzy matching. However, exact matching by adding quotations around the query term would allow me to decide when I want precision and when I favor recall.

To do:

  • implement exact matching so that a query for "Masada" (with quotes) will only return records if the Solr document contains "Masada" spelled exactly as my query.
  • "Qur'an" OR "Koran" should only return records if the Solr document contains "Qur'an" OR "Koran" (note there is an open ticket for OR search Boolean OR not working in DLME search #1352, we just want to sure OR with exact match works as well.
  • "Persian Gulf" AND "slavery" should only return records if the Solr document contains both query terms spelled exactly as they were entered.
  • Queries without quotations should continue to work as they currently do.
  • None of the exact matches should be case sensitive

Feel free to combine this with #1352 or break it out into additional tickets as you see fit.

@corylown
Copy link
Contributor

corylown commented Mar 1, 2022

We should set aside concerns about Boolean operators for this issue, since that’s a separate feature in Solr and has been addressed via #1352.

As written, the search features described in this issue, namely that terms/phrases in quotes should match differently analyzed (more exact) fields than terms not in quotes is not supported by Solr. Quotes in Solr only control phrase matches: i.e. any match must include the quoted terms in the same sequence (although the precision of the term position is controllable via the qs param), but the actual field analysis is the same with or without quotes.

In the DLME schema both the text and text_ws field types are minimally processed and should help provide good relevance ranking since they are boosted above less exact field types that include stemmers, like text_en. This is typically a good strategy with Solr. It favors recall over precision, but boosts more exact matches higher in the ranking. The downside is apparent when there are no precise matches and you are likely to see less precise matches appearing at or near the top of the relevance ranked results.

There are other options, but they range from having some serious UX drawbacks to being complicated to implement -- requiring careful consideration.

  • Turn off stemming for all field types.
    This would require more precise querying from all users and may prove frustrating for people who are used to more forgiving search engines.

  • Provide some additional fielded search options in the drop-down to allow searchers to opt-in to more exact matching.
    To accomplish this we would add some search field options like “Everything (exact)”, and “Title (exact),” etc. to the UI. Thorough analysis would be required to ensure that we have fields indexed appropriately to support these options. We would need to define some new field sets in Solr (exact_qf, title_exact_qf, etc.). This would require a significant effort in analysis, execution, and testing to ensure we have all the right fields indexed in the right way and configured correctly to add this functionality -- and to ensure we have not broken existing search behavior. It would, however, provide an opt-in option for more precise matching without harming regular searches.

  • Further analysis of the underlying problem to see if there are other options.
    There may be some benefit to more fully understanding the underlying need expressed via this feature request. Is this for end-users? Is this something that’s more for site administrators and analysis purposes? There may be other solutions if we can more fully understand the problem we’d like to solve.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: No status
Development

No branches or pull requests

2 participants