-
Notifications
You must be signed in to change notification settings - Fork 10
Decision: Next steps for improving search
Thing | Info |
---|---|
Relevant features | Search |
Date started | 2022-04-04 |
Date finished | 2022-08? |
Decision status | done |
Summary of outcome | We decided to stick with Postgres... but as of 2022-11, we're looking at this again. |
We want to provide excellent resources & regulations search for our users. Our current technology, Postgres full text search, has limitations and is a bit fussy.
How can we better support the sophisticated types of searches that our users need to do?
Can we reduce the amount of work needed to support them?
- Results are granular at the level of a regulation section, which makes sense to our audience.
- Search query keywords are highlighted in bold on the results page, which is helpful for quickly deciding which result to click on.
- Searching for a regulation section citation number (like 433.112) reliably brings up that section as the first result, which is what people expect.
All of this applies to both regulation and resource (supplemental content and Federal Register documents) search. (See also Supplemental content search stories)
For users:
- Quoted phrases / phrase matching: when I search for a quoted phrase (such as "State Medicaid Manual"), the results feature sections that match that specific phrase.
- Plurals (stemming): when I search for terms that have equivalent meaning to me in singular or plural form (such as "person" and "persons"), search results include sections that have terms in both of those forms.
- Verb tenses (stemming): when I search for terms that have equivalent meaning to me in different forms of the verb (such as "assess" and "assessed"), search results include sections that have terms in all of those forms.
- Filter results by part, subpart, section: when I search for a term, I want to be able to restrict my search results to a specific chunk of the content.
- Filter results for resources & FR docs by category and subcategory
For the team:
- Log search terms so that we can analyze them to improve the system
Notes as of 2022-11-17:
We now have most of those things! We don't have "filter by part, subpart, section" yet. Other things we now want:
- More boolean operators: we currently only have "hospital AND clinic", but users have asked for the ability to search for "hospital OR clinic", "hospital -clinic" (NOT), etc.
Nice-to-haves:
- Ability to do autocomplete for common terms
- Ability to "annotate" a resource item with hidden keywords, to mitigate the downsides of only using titles and descriptions in our index for resources
- Our users would love full-text search for resources, but this would be a big piece of work
In research, our users have described the frustration of searching for a term they use frequently that does not match the term for that concept in the regulations.
We want to help our users navigate common synonyms and abbreviations. This does not need to happen within the search engine itself. For example, we could compile a thesaurus by hand and use it to suggest alternate search queries.
Note: as of 2022-11-17, we made this, and it works!
What technology should we use? What do we need to do to support these use cases?
Related stories: "Search Results" Epic in JIRA.
We already have this in place. See Django full text search documentation + PostgreSQL textsearch documentation.
In April 2022, we were trying to figure out if Postgres full text search on AWS RDS would be sufficient for our needs, since we believed it would be the most efficient option.
In August 2022, we said:
We have fully embraced the built-in search functionality that Postgres offers. The Postgres documentation is good at explaining how the text search works. I would highly recommend reading through it (and even running the queries locally) to understand how search and lexeme parsing works. When implementing search we did make one kind of unobvious/arbitrary choice to use ts_rank for unquoted queries and ts_rank_cd for quoted queries. The differences are not super clear, but they are described on the documentation page linked above.
Pros:
- We already have it in place, and it supports most of the "functionality that we need" list above
- Commonly-used solution, not rare/unusual/specialized.
- We can customize the algorithm, such as applying regular expressions to parse out elements of queries, and weighting various aspects in various ways.
- Not expensive for CMS.
Cons:
- The custom dictionary and synonym features could help us, but AWS RDS doesn't enable using them.
- Algorithm is kind of opaque. We have some relevancy challenges, including for queries special to our content: citations (
45 CFR 156
;447.31(b)
;447 Subpart C
;1903(a)
,76 FR 21949
).
Notes:
- Want to figure out how to get additional boolean options. We have AND and phrase search, but our users have also asked for OR, and NOT would be helpful too. The built-in "websearch" option would support this, but it's unclear whether using it would help us overall.
This is derived from Elasticsearch. Not entirely interoperable with Elasticsearch, but similar. Amazon documentation. "How to use Elasticsearch with Django" article (May 2019).
January 2022 research spike notes. We started trying it out, but it was going to take a fair bit of work to set up. We found a simpler way to do what we were looking for at the time (synonyms).
Pros:
- Managed service
- Fancy
- Interesting ranking options: "If a distinctive keyword appears more frequently in a document, BM-25 assigns a higher relevance score to that document...Learning to Rank is an open-source plugin that lets you use machine learning and behavioral data to tune the relevance of documents."
Cons:
- Complicated to set up
- Expensive
- May be overpowered for our needs
Pros:
- Could maybe work around the constraints of AWS RDS, such as not being able to do custom dictionaries
Cons:
- Annoying to maintain
- More expensive for CMS; when we inquired with our CMS partners about setting this up, they asked if we had investigated search.gov and other less-expensive options
- Not necessary, since we figured out alternative implementation for synonyms
Updated 2022-12-5
Pros:
- We wouldn't have to maintain everything ourselves.
- Our current tech lead is very familiar with this tool.
- Free to CMS.
Cons:
- We talked with the search.gov team in April 2022 and tried it out, but it was difficult to get it fully working.
- We don't know of a way to do search filters for regulations and resources (like: search part 436 for "hospital").
- May not be able to do filtering
Notes:
-
It typically relies on the site being indexable by Bing, but we don't want the site to be fully indexable until it's fully public. It also has another non-Bing option that we may be able to use. We didn't try that.
-
We learned that the none Bing option is available by either using their API or sending RSS feeds. When we send RSS Feeds we can first send one large feed for indexing, then setup a daily lambda function to generate RSS feeds for any updates or new documents.
Can it match current functionality?
- Can it index our regulation text and provide helpful results?
- Our results are limited to subpart pages, so we can’t really compare them to the live site, because we need section-level results to have relevant results.
- We would need to give it section pages in the site map.
- We would need to give the section pages in the rss feed
- Our HTML page titles only give the part number and subpart letter, not any keywords, which isn’t sufficient context for search results. We’d need to change this if we were sticking with this kind of search.
- In our current search results, we provide the part name and section name, which are both very helpful.
- Our results are limited to subpart pages, so we can’t really compare them to the live site, because we need section-level results to have relevant results.
- If you search for a citation, does it return that citation as the top result?
- Not with subpart search - section search would probably work better.
- Can it do keyword search of the titles/names/etc in the supplemental content database (not the content of the documents)?
- We tried to get this working - we set up
/sitemap.xml
, which includes URLs like/supplemental_content/2216/
for individual pieces of supplemental content, so that these items could be indexed. - We couldn't get it working.
- We tried to get this working - we set up
- Can we retain our thesaurus/synonym-matching feature? (Screenshot above)
- Sure, assuming we integrated search.gov as a backend for our custom interface.
Can it fulfill other things we want?
- Can it index the contents of the documents linked in the supplemental content database?
- Not sure
- Would it allow for filtered searches of supplemental content, such as by category?
- Not sure
- What else does it do that could be helpful for us? (For example: automated tracking of top search queries, Federal Register document search, "Best Bets".)
- We already track queries in Google Analytics.
- It’s interesting that it automatically returns results from the CMS video channel and the Federal Register, but they aren’t very relevant - I’d rather simply incorporate any relevant videos and rules into our database so that we can make sure that all the results we’re providing via search are relevant.
- Our synonym/thesaurus feature is a bit like Best Bets, and we could expand on it without using Best Bets.
We'd need to do these things for a second iteration that would enable better testing:
- Index section pages (would mean that we’d have to revive section pages or create stub redirect pages)
- Our users are interested in section-level pages, and we've explored and tested UI options for incorporating them into our site (along with part and subpart view) - we just haven't prioritized that, because it's less important than other things ... like search relevance. :)
- Get indexing of supplemental content working
- Turn off video and FR results
We researched whether we could use search.gov to index the contents of resources. Notes.
TBD
Please note that all pages on this GitHub wiki are draft working documents, not complete or polished.
Our software team puts non-sensitive technical documentation on this wiki to help us maintain a shared understanding of our work, including what we've done and why. As an open source project, this documentation is public in case anything in here is helpful to other teams, including anyone who may be interested in reusing our code for other projects.
For context, see the HHS Open Source Software plan (2016) and CMS Technical Reference Architecture section about Open Source Software, including Business Rule BR-OSS-13: "CMS-Released OSS Code Must Include Documentation Accessible to the Open Source Community".
For CMS staff and contractors: internal documentation on Enterprise Confluence (requires login).
- Federal policy structured data options
- Regulations
- Resources
- Statute
- Citation formats
- Export data
- Site homepage
- Content authoring
- Search
- Timeline
- Not built
- 2021
- Reg content sources
- Default content view
- System last updated behavior
- Paragraph indenting
- Content authoring workflow
- Browser support
- Focus in left nav submenu
- Multiple content views
- Content review workflow
- Wayfinding while reading content
- Display of rules and NPRMs in sidebar
- Empty states for supplemental content
- 2022
- 2023
- 2024
- Medicaid and CHIP regulations user experience
- Initial pilot research outline
- Comparative analysis
- Statute research
- Usability study SOP
- 2021
- 2022
- 2023-2024: 🔒 Dovetail (requires login)
- 🔒 Overview (requires login)
- Authentication and authorization
- Frontend caching
- Validation checklist
- Search
- Security tools
- Tests and linting
- Archive