Skip to content

mattford63/riverford-poc

Repository files navigation

Riverford PoC

  • See the design brief here.

  • See the PoC development history here for a track of development decisions and optimisations.

Approach

Given the time constraints to do this the focus has been on a solid foundation from which to further develop out the search solution.

The essential requirement of speed (<10ms) and simple relevance has been focused on. Sophisticated search techniques and recipe management are out of scope for now.

The very first pass used clucie to get a feel for the problem space (see here). However, after talking with RiverFord a from-scratch approach was required.

Inverted indexes are excellent for building text search engines. "Introduction to Information Retrieval" by Manning, Raghavan and Schutze provides a very good overview of the process and efficiency algorithms. The code implements the methods and algorithms in the book outlined here.

The core abstraction is an index-store which contains indexes.The recipes are of a consistent format and have the following sections

  • title
  • introduction
  • ingredients
  • method

The code is written such that each section of the recipe is an index within the index-store. Allowing searches to be constrained to a particular section. Then a merge operation over the four indexes creates a 5th index all - over which the entire recipe can be searched. Also, each document is stored as it's own index, each with it's own recipe section and a combined :all index. There is also a single global :all index to search everything. All these indexes allow relevancy based search at the expense of large increase in index-store size.

A stemming algorithm (porter) and phonetic library (metaphone) is used in the linguistic tokenizer to allow searches to work over mis-spellings and changed word endings.

The speed of the search comes from using ordered vectors as the data structure for the document ids in the reverse index. A class of merge-algorithms that rely on the fact that vectors are ordered offer more speed up when performing operations. The computational expense of the initial import operation ~50s is the price paid up front for sub <5ms search later.

A DSL consisting of intersect, union and a limited not provide search operations over the index-store (and relevance based versions of the same operators).

Usage

To test run lein test. There's quite a lot of test coverage and the page also highlights the shape of the data structures used.

The PoC is currently managed by the REPL. An example section at the end of core.clj can be used to explore the functionality. Just remember to unzip and change the directory location appropriately.

Issues

  • The not implementation is naive and very restricted at present to where it can be used. The fix is to maintain in each index a full list of the document ids used in that index.
  • The storage format is very in-efficient now that each document is added as in index in the store. The memory consumption of the index-store is around x12 of the size of the on-disk space.

Future work

  • Use a more compact form of index for individual documents.
  • Migrate away from PoC code e.g. namespace fuctions appropriately, split tests etc.
  • It should be relatively quick to add one of the these clojure fuzzy stemmer algorithms to here and here. This would be a good way to make the search seem way more sophisticated.. integrates Porter Stemming (but is not merged as it requires updating all the tests).
  • At present the individual document frequencies are lost when combined. A more sophisticated reverse index data structure would allow this data to be kept - this would open up the door to relevancy scoring based on term frequency. DONE
  • Add support for mis-spellings etc
  • The tokenizer used records word position, by extending the reverse index data structure even further we could capture this also.
  • Think about whether the index-store should be atomic.
  • Think about a mechanism of updating the all index on other index updates. DONE
  • Parallelise the input (possibly transducers) but is it worth it? Update: on my desktop the import is now ~8s

License

Copyright © 2020 Matt Ford

This program and the accompanying materials are made available under the terms of the Eclipse Public License 2.0 which is available at http://www.eclipse.org/legal/epl-2.0.

This Source Code may also be made available under the following Secondary Licenses when the conditions for such availability set forth in the Eclipse Public License, v. 2.0 are satisfied: GNU General Public License as published by the Free Software Foundation, either version 2 of the License, or (at your option) any later version, with the GNU Classpath Exception which is available at https://www.gnu.org/software/classpath/license.html.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published