Skip to content

0.3

Compare
Choose a tag to compare
@jbaiter jbaiter released this 25 Jul 16:03

This release brings some sweeping changes across the codebase, all aimed at making the plugin much simpler to use and less complicated to maintain. However, this also means a lot of breaking changes. It's best to go through the documentation (which has been simplified and was largely rewritten) again and see what changes you need to apply to your setup.

  • Specifying path resolving is no longer neccessary. You now pass a pointer to one or more files (or regions thereof) directly in the index document. The pointer will be stored with the document and used to locate the input file(s) during highlighting. Refer to the documentation for more details. This should also increase indexing performance and decrease the memory requirements, since the complete OCR document does not need to be kept in memory.
  • hl.weightMatches now works with UTF8. You no longer need to ASCII-encode your OCR files to be able to use Solr's superior highlighting approach. Due to the first change, the plugin now takes care of mapping UTF8 byte-offsets to character offsets by itself. This also means all code related to storing byte offsets in payloads is gone.
  • Specifying the OCR format is no longer neccessary. The plugin now offers a single OcrFormatCharFilter that will auto-detect the OCR format used for a given document and select the correct analysis chain. This means that using multiple OCR formats for the same field is now possible!
  • Performance improvements. Some optimizations were done to the way the plugin seeks through the OCR files. You should see a substantial performance improvement for documents with a low density of multi-byte codepoints, especially English. Also included is a new hl.ocr.maxPassages parameter to control how many passages are looked at for building the response, which can have an enormous impact on performance.

Major Breaking Changes:

  • HighlightComponent is now called OcrHighlightComponent for more clarity
  • OCR fields to be highlighted now need to be passed with the hl.ocr.fl parameter
  • Auto-detection of highlightable fields is no longer possible with the standard highlighter, fields to be highlighted need to be passed explicitely with the hl.fl parameter
  • In the order of components, the OCR highlighting component needs to come before the standard highlighter to avoid conflicts.