Decision: Next steps for improving search

Thing	Info
Relevant features	Search
Date started	2022-04-04
Date finished	2022-08?
Decision status	done
Summary of outcome	We decided to stick with Postgres... but as of 2022-11, we're looking at this again.

Background/context

We want to provide excellent resources & regulations search for our users. Our current technology, Postgres full text search, has limitations and is a bit fussy.

Core questions

How can we better support the sophisticated types of searches that our users need to do?

Can we reduce the amount of work needed to support them?

What we know

Helpful things about current search system

Results are granular at the level of a regulation section, which makes sense to our audience.
Search query keywords are highlighted in bold on the results page, which is helpful for quickly deciding which result to click on.
Searching for a regulation section citation number (like 433.112) reliably brings up that section as the first result, which is what people expect.

Functionality that we need

All of this applies to both regulation and resource (supplemental content and Federal Register documents) search. (See also Supplemental content search stories)

For users:

Quoted phrases / phrase matching: when I search for a quoted phrase (such as "State Medicaid Manual"), the results feature sections that match that specific phrase.
Plurals (stemming): when I search for terms that have equivalent meaning to me in singular or plural form (such as "person" and "persons"), search results include sections that have terms in both of those forms.
Verb tenses (stemming): when I search for terms that have equivalent meaning to me in different forms of the verb (such as "assess" and "assessed"), search results include sections that have terms in all of those forms.
Filter results by part, subpart, section: when I search for a term, I want to be able to restrict my search results to a specific chunk of the content.
Filter results for resources & FR docs by category and subcategory

For the team:

Log search terms so that we can analyze them to improve the system

Notes as of 2022-11-17:

We now have most of those things! We don't have "filter by part, subpart, section" yet. Other things we now want:

More boolean operators: we currently only have "hospital AND clinic", but users have asked for the ability to search for "hospital OR clinic", "hospital -clinic" (NOT), etc.

Nice-to-haves:

Ability to do autocomplete for common terms
Ability to "annotate" a resource item with hidden keywords, to mitigate the downsides of only using titles and descriptions in our index for resources
Our users would love full-text search for resources, but this would be a big piece of work

Related need: thesaurus

In research, our users have described the frustration of searching for a term they use frequently that does not match the term for that concept in the regulations.

We want to help our users navigate common synonyms and abbreviations. This does not need to happen within the search engine itself. For example, we could compile a thesaurus by hand and use it to suggest alternate search queries.

Note: as of 2022-11-17, we made this, and it works!

Things we need to decide + options for them

What technology should we use? What do we need to do to support these use cases?

Related stories: "Search Results" Epic in JIRA.

Option: Existing Postgres full text search on AWS RDS

We already have this in place. See Django full text search documentation + PostgreSQL textsearch documentation.

In April 2022, we were trying to figure out if Postgres full text search on AWS RDS would be sufficient for our needs, since we believed it would be the most efficient option.

In August 2022, we said:

We have fully embraced the built-in search functionality that Postgres offers. The Postgres documentation is good at explaining how the text search works. I would highly recommend reading through it (and even running the queries locally) to understand how search and lexeme parsing works. When implementing search we did make one kind of unobvious/arbitrary choice to use ts_rank for unquoted queries and ts_rank_cd for quoted queries. The differences are not super clear, but they are described on the documentation page linked above.

Update 2022-11-17

Pros:

We already have it in place, and it supports most of the "functionality that we need" list above
Commonly-used solution, not rare/unusual/specialized.
We can customize the algorithm, such as applying regular expressions to parse out elements of queries, and weighting various aspects in various ways.
Not expensive for CMS.

Cons:

The custom dictionary and synonym features could help us, but AWS RDS doesn't enable using them.
Algorithm is kind of opaque. We have some relevancy challenges, including for queries special to our content: citations (45 CFR 156; 447.31(b); 447 Subpart C; 1903(a), 76 FR 21949).

Notes:

Want to figure out how to get additional boolean options. We have AND and phrase search, but our users have also asked for OR, and NOT would be helpful too. The built-in "websearch" option would support this, but it's unclear whether using it would help us overall.

Option: Amazon OpenSearch Service

This is derived from Elasticsearch. Not entirely interoperable with Elasticsearch, but similar. Amazon documentation. "How to use Elasticsearch with Django" article (May 2019).

January 2022 research spike notes. We started trying it out, but it was going to take a fair bit of work to set up. We found a simpler way to do what we were looking for at the time (synonyms).

Pros:

Managed service
Fancy
Interesting ranking options: "If a distinctive keyword appears more frequently in a document, BM-25 assigns a higher relevance score to that document...Learning to Rank is an open-source plugin that lets you use machine learning and behavioral data to tune the relevance of documents."

Cons:

Complicated to set up
Expensive
May be overpowered for our needs

Option: Running database off an EC2 instance

Pros:

Could maybe work around the constraints of AWS RDS, such as not being able to do custom dictionaries

Cons:

Annoying to maintain
More expensive for CMS; when we inquired with our CMS partners about setting this up, they asked if we had investigated search.gov and other less-expensive options
Not necessary, since we figured out alternative implementation for synonyms

Option: Search.gov

Updated 2022-12-5

Pros:

We wouldn't have to maintain everything ourselves.
Our current tech lead is very familiar with this tool.
Free to CMS.

Cons:

We talked with the search.gov team in April 2022 and tried it out, but it was difficult to get it fully working.
We don't know of a way to do search filters for regulations and resources (like: search part 436 for "hospital").
May not be able to do filtering

Notes:

It typically relies on the site being indexable by Bing, but we don't want the site to be fully indexable until it's fully public. It also has another non-Bing option that we may be able to use. We didn't try that.
We learned that the none Bing option is available by either using their API or sending RSS feeds. When we send RSS Feeds we can first send one large feed for indexing, then setup a daily lambda function to generate RSS feeds for any updates or new documents.

Questions we researched in April 2022

Can it match current functionality?

Can it index our regulation text and provide helpful results?
- Our results are limited to subpart pages, so we can’t really compare them to the live site, because we need section-level results to have relevant results.
  - We would need to give it section pages in the site map.
  - We would need to give the section pages in the rss feed
- Our HTML page titles only give the part number and subpart letter, not any keywords, which isn’t sufficient context for search results. We’d need to change this if we were sticking with this kind of search.
  - In our current search results, we provide the part name and section name, which are both very helpful.
If you search for a citation, does it return that citation as the top result?
- Not with subpart search - section search would probably work better.
Can it do keyword search of the titles/names/etc in the supplemental content database (not the content of the documents)?
- We tried to get this working - we set up /sitemap.xml, which includes URLs like /supplemental_content/2216/ for individual pieces of supplemental content, so that these items could be indexed.
- We couldn't get it working.
Can we retain our thesaurus/synonym-matching feature? (Screenshot above)
- Sure, assuming we integrated search.gov as a backend for our custom interface.

Can it fulfill other things we want?

Can it index the contents of the documents linked in the supplemental content database?
- Not sure
Would it allow for filtered searches of supplemental content, such as by category?
- Not sure
What else does it do that could be helpful for us? (For example: automated tracking of top search queries, Federal Register document search, "Best Bets".)
- We already track queries in Google Analytics.
- It’s interesting that it automatically returns results from the CMS video channel and the Federal Register, but they aren’t very relevant - I’d rather simply incorporate any relevant videos and rules into our database so that we can make sure that all the results we’re providing via search are relevant.
- Our synonym/thesaurus feature is a bit like Best Bets, and we could expand on it without using Best Bets.

We'd need to do these things for a second iteration that would enable better testing:

Index section pages (would mean that we’d have to revive section pages or create stub redirect pages)
- Our users are interested in section-level pages, and we've explored and tested UI options for incorporating them into our site (along with part and subpart view) - we just haven't prioritized that, because it's less important than other things ... like search relevance. :)
Get indexing of supplemental content working
Turn off video and FR results

Additional info as of February 2023

We researched whether we could use search.gov to index the contents of resources. Notes.

Decision

TBD

Please note that all pages on this GitHub wiki are draft working documents, not complete or polished.

Our software team puts non-sensitive technical documentation on this wiki to help us maintain a shared understanding of our work, including what we've done and why. As an open source project, this documentation is public in case anything in here is helpful to other teams, including anyone who may be interested in reusing our code for other projects.

For context, see the HHS Open Source Software plan (2016) and CMS Technical Reference Architecture section about Open Source Software, including Business Rule BR-OSS-13: "CMS-Released OSS Code Must Include Documentation Accessible to the Open Source Community".

For CMS staff and contractors: internal documentation on Enterprise Confluence (requires login).

Overview

Project context / problem statement
Audiences
Use cases
Functionality
Archive
- Pilot stage
- Potential capabilities

Data

Features

Site homepage
Content authoring
- Admin panel structure
- Content editor user flows
Search
Timeline
Not built
- Definitions

Decisions

User research

Usability studies

Design

Development

🔒 Overview (requires login)
Authentication and authorization
- Roles and permissions
- Test users
Frontend caching
Validation checklist
Search
- Regulations Search
- Text Extractor
Security tools
- Gitleaks
- Snyk
Tests and linting
- ESLint (JavaScript)
Archive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decision: Next steps for improving search

Background/context

Core questions

What we know

Helpful things about current search system

Functionality that we need

Related need: thesaurus

Things we need to decide + options for them

Option: Existing Postgres full text search on AWS RDS

Update 2022-11-17

Option: Amazon OpenSearch Service

Option: Running database off an EC2 instance

Option: Search.gov

Questions we researched in April 2022

Additional info as of February 2023

Decision

Overview

Data

Features

Decisions

User research

Usability studies

Design

Development

Clone this wiki locally