Skip to content

15‐full‐text‐search

djbpitt edited this page Dec 29, 2024 · 8 revisions

Recap

In the previous sections, you created a reading interface for the articles, first by adding links from the title page to your reading views, then by adding ancillary information to the model using fields and your base XML-annotated articles, and then by transforming that model into an HTML view page and XML view page. In this section, you‘ll replace the titles list page entirely with a new discovery mechanism, one that, among other things, implements full-text search.

Goals

Search functionality represents a major aspect of many editions that aggregate sources. Textual discovery may be a primary goal of your edition, in which case you will want to think about which elements you plan to expose for discovery, sorting, and filtering relatively early in your process. Our search interface privileges the features described below, and deciding to support one set of search features may complicate or foreclose other kinds of discovery:

screenshot of the Ghost Hoax application search interface, the page includes a full text search box, a list of article links, both to article reading views and XML views, and a publisher facet for filtering

Above is a screenshot of the Ghost Hoax application’s search interface, on which the interface for the hoaXed application is based. Below we will discuss the features and the high-level approach to their implementation. Throughout this tutorial, we will make reference to the long-form, detailed documentation we wrote for the Ghost Hoax application, Facets and Fields in eXist-db.

Search features

The search parameters we implement are 1) full-text indexed search and 2) search by publisher.

Full-text searching enables content-based exploration, which enables user to discover hoax stories associated with their own interests. For example, users who wish to see ghost hoaxes with police involvement can search terms like “police” or “law”. Full-text search with Lucene (the full-text search engine embedded in eXist-db) will also enable highlighting for users who click into and read an article, foregrounding the search terms for easier discovery in context. Information about publishers can be used for other purposes because each publisher implies details about proximity to the city, class appeal, frequency, and relative article density within the collection. Users can see easily, for example, that the collection heavily favors The Times, and can select for (or exclude) that publication from their results if desired.

This search interface does not include the date-by-decades and month-year facets we created for the original Hoax application because we wanted to keep the development process for this interface more limited and manageable. Date elements can be expressed in many ways in search, and you have probably used library search interfaces that provide options for dealing with date ranges. In that application we grouped publications by decade to show relative density and allow users to explore within a decade, when stories were likely to be related to one another. In other editions, developed for other purposes, more traditional ranges may be more appropriate to narrow search, rather than expand discovery.

Search overview

The interface for the hoaXed project supports a combination of searching by full text (e.g., “find all articles that contain the word “police”) and searching by property value (e.g., “find all articles published in The Times”). These types of searches can be combined. Searching by full-text or by property value is fast regardless of the size of the collection because eXist-db builds quick-access lookup index files as part of the installation process. The developer controls the index files (that is, the types of searches to be supported) with configuration options in the eXist-db collection.xconf index file.

Full-text search

Full-text searching is relatively straight-forward: the developer specifies which elements (parts of the document) should be searchable and then creates a query form that eXist-db translates into a query that uses indexed lookups. As explained in the official eXist-db documentation at Configuring Indexes and Full Text Index, including a line like:

<text qname="tei:TEI"/>

as a child of the <lucene> element in collection.xconf tells eXist-db to construct a full-text index for all <TEI> elements, which lets us run a query to return, for example, all articles in the project (all of which have <TEI> root elements) that contain a particular string anywhere in their content. If we prefer to search more narrowly (for example, in the TEI <text> element but not the <teiHeader>) we can configure a full-text index only on <text>. eXist-db supports exact matching, fuzzy matching, substring matching, phrase matching, and other types of full-text retrieval, and we include a small informational overlay in the interface that shows a few of the options. In the image below we have moved the mouse over the “ⓘ” sign next to the text-input window, which has caused the informational overlay to appear:

Screenshot 2024-05-08 at 1 42 26 PM

Faceted search

Faceted search is slightly more complex, both conceptually and on the implementation level, than full-text search. Facets group indexed items by certain information that is pre-defined at indexing time (when the application is installed or re-indexed). All facets can be written in other ways, meaning you can absolutely construct queries using fields and text search that yield the same output as a facet search. Facets are defined at index-time rather than query-time, though, so facets perform more efficiently. Facets also offer the drill-down features that can be helpful when you want to use search to narrow down results.

To implement, we add a child <facet> element to the <text> element we defined in our full-text search section.

<text qname="tei:TEI">
  <facet 
    dimension="publisher" 
    expression="descendant::tei:publicationStmt/tei:publisher"/>
</text>

Because we added this facet to the configuration-file <text> element that matches the entire TEI root element, we will now be able to identify the documents by the publisher in the publication statement, and refer to them with the dimension "publisher".

To use this new identification for searching we can use the ft:query() and ft:facets() functions to refer to these pre-indexed references. In the code snippet below, we use the publisher facet to return all the publishers and the number of articles each published in this corpus.

xquery version "3.1";
declare namespace tei="http://www.tei-c.org/ns/1.0";
let $hits as element(tei:TEI)+ := 
    collection('/db/apps/hoaXed/data/hoax_xml')/tei:TEI[ft:query(.,())]
let $facets := ft:facets($hits, "publisher")
return 
    <facet_test>{
        let $facet-elements := 
            map:for-each($facets, function($label, $count) {
                <facet>
                    <label>{$label}</label>
                    <count>{$count}</count>
            </facet>})
        for $facet-element in $facet-elements
        order by $facet-element/count descending,
            $facet-element/label
        return $facet-element
    }</facet_test>

The results look like:

<facet_test>
    <facet>
        <label>Times, The</label>
        <count>8</count>
    </facet>
    <facet>
        <label>Penny Satirist, The</label>
        <count>4</count>
    </facet>
    <facet>
        <label>Leader, The</label>
        <count>3</count>
    </facet>
    <!-- more <facet> elements -->
</facet_test>

In order to retrieve facets, we need to first use the ft:query() function:

let $hits as element(tei:TEI)+ := 
  collection('/db/apps/hoaXed/data/hoax_xml')/tei:TEI[ft:query(.,())]

This XQuery expression first selects the root <TEI> elements of all articles in our corpus. The ft:query() function in the predicate says to filter the results and keep those articles (the first argument, the dot, refers to the current context) that contain the string specified in the second argument. Since the second argument is an empty sequence, this selects all articles. It might seems as if filtering to select all results has no effect because it does no filtering, but what it has is a side-effect: it primes eXist-db to use the ft:facets() function on the new $hits variable, so that:

let $facets := ft:facets($hits, "publisher")

organizes the selected files by publisher information. Although in this case we do not provide a search term for full text query, and therefore do no filtering, we could supply a string like “police” as the second argument to ft:query(), which would select only articles by publisher which contain that term. When building out the full-text search, we will include the user's provided search term here, rather than leaving it empty.

Implementation at the model level

In this section, we'll work with search.xql in the modules subdirectory of branch 15-full-text-search. We recommend following along with the tutorial there, as we will define what each part does, but we will not reproduce every line of code here.

In an effort to remain self-documenting, the search model offers a description of five behaviors (there are also view behaviors; see below), or possible outcomes from a search query:

  1. All facet possibilities are always shown, whether selected or not. Facets are multi-select, so show zero-valued labels because they can be added to the selection. (This is not the default behavior for eXist-db facets, which means that we implement it ourselves. By default eXist-db does not return anything at all for facet values (e.g., publishers) that don’t appear in the items selected from corpus.
  2. Update after each change.
  3. Facet counts are formatted as two digits, x/y, where x is number of items selected by search term for that publisher and y is total number of items for that publisher (invariant).
  4. Whether the facet value is selected is indicated by maintaining the checkbox state when you change the full-text value and rerun the search.
  5. Normally returns articles that match combination of search term and publisher facets.
  6. There are three situations that yield no hits: a) If search term is not found in any documents (not just for selected facets), return informative message. b) If search term is not found in selected documents, but appears in others, return informative message. c) RESUME HERE 2024-12-29 If no search term and no results because selected publishers and search term do not intersect, return informative message.

This is just the model, so what it returns will then be serialized by the view. In our implementation of MVC, the view is purely presentational while the model is purely informational. Here we might more effectively mean “return” than “show” as this level will not be shown to the user.

Implementation at the view level

The model output has three children: all content, filtered content, and selected facets. “All content” might be a slightly confusing name, because it really refers to all content which matches the search term. If no input is provided for a search term, all articles are returned. Filtered content refers to articles and counts that result from a facet selection, so this represents some portion of all content. Selected facets is used in a script to maintain facet selections, so users remember what they selected to get their results and can continue to fine tune after their initial search.

These three elements hold all the content presented on the page. As with other view modules, we are using typeswitch to transform this content to HTML. Unlike others, however, there are HTML elements which are not directly related to model content. For example, we need to render a search bar for the user to type into before we get a search term. The typeswitch works from an empty element <m:search-term/> to render a search bar, an informative tooltip, and a submit search button. After the view has been rendered the first time, the user can input their search term and the term is passed back to the model, and returned with the search results again: <m:search-term>police</m:search-term>. This is the first instance of user-provided input we have used in the application, and it makes full use of the MVC architecture to create dynamic results. The search term "police" matches 12 articles from seven different publishers. The function defined to handle publishers creates the list of all publishers, adding attribute classes which CSS uses to grey out unavailable publishers. It also maintains any checked boxes if a user has filtered further by publisher. The article functions work much the same as in our former article list, formatting titles, dates, and publisher names and creating links to reading and XML views. Finally, an error function handles invalid search terms, passing the error message into the view along with a message.

Development of this view module took place iteratively, building as it does in this tutorial on code written for the article list view. Excluded from this tutorial are decade facets which allow for improved sorting, as well as justification for many of the search design features. Our conclusion after building search from scratch (rather than implementing some kind of prefab search feature or only allowing simple full text search) is that each decision has consequences for the other decisions we made: the research required was extensive, and not necessarily applicable to another project. With that familiar refrain, we also offer resources below which you may find useful.

Further reading

For a more extensive description of the implementation of search in the original Ghost Hoax application, see https://github.com/Pittsburgh-NEH-Institute/pr-app/blob/main/pr-app-tutorials/facets-and-fields.md.

We also recommend the eXist-db documentation on fields as a helpful reference guide: https://exist-db.org/exist/apps/doc/lucene.xml?field=all&id=D3.15.73#query-fields