Skip to content
jkraemer edited this page Sep 13, 2010 · 1 revision

Advanced Usage notes

Specifying per field options

You can customize how specific fields get indexed like that:

acts_as_ferret :fields => { :title         => { :boost => 2 },
                            :description   => { :boost => 1.5,
                                                 :index => :untokenized },
                            :another_field => {}
                          }

This call would give a boost (a rating of importance to the index) of 2 to the title field, one of 1.5 to the description field. The contents of the description field won’t be tokenized on indexing time. The ‘another_field’ field will be indexed using the acts_as_ferret’s default indexing options, which are:

 :store       => :no
  :index       => :yes
  :term_vector => :with_positions_offsets
  :boost       => 1.0

Please see http://ferret.davebalmain.com/api/classes/Ferret/Index/FieldInfo.html for more information about the possible values and what they mean.

Custom Analyzers

No Stop words

To prevent ferret from stop word filtering, use a StandardAnalyzer with an empty stop word list:

# pre-0.4.3 (two arguments hashes, the second containing Ferret options)
acts_as_ferret( { :fields => { ... } }, { :analyzer => Ferret::Analysis::StandardAnalyzer.new([]) } )
# 0.4.3 and later: only one arguments hash, all Ferret options go into the :ferret Hash:
acts_as_ferret( :fields => { ... }, :ferret => { :analyzer => Ferret::Analysis::StandardAnalyzer.new([]) } )

Note that the analyzer option goes into the second argument hash (where all options go that are handed through to ferret’s Ferret::Index::Index instance).

Stemming

Stemming is language-dependent, you’ll have to create a custom analyzer. For german content
this could look like that:

class GermanStemmingAnalyzer < Ferret::Analysis::Analyzer
  include Ferret::Analysis
  def initialize(stop_words = FULL_GERMAN_STOP_WORDS)
    @stop_words = stop_words
  end
  def token_stream(field, str)
    StemFilter.new(StopFilter.new(LowerCaseFilter.new(StandardTokenizer.new(str)), @stop_words), 'de')
  end
end

Then tell aaf to use this analyzer:

acts_as_ferret({ :fields => { ... } }, :analyzer => GermanStemmingAnalyzer.new)

Disabling automatic indexing

There are some usecases where the automatic indexing done by acts_as_ferret gets in your way, be it
performance-wise (say, you just call save to store a new relationship), or because of content that
has to be saved but is not in a state you want to be found through searches yet.

Since Revision 88 we have the disable_ferret method for these cases (examples taken straight
from the unit tests):

Disable aaf for the next save call

content = Content.new(:title => 'should not get saved', :description => 'do not find me')
content.disable_ferret  # or content.disable_ferret(:once) if you like to be explicit
content.save            # no indexing here
assert Content.find_by_contents('"find me"').empty?
content.save            # now it gets indexed
assert_equal Content.find_by_contents('"find me"').first

Disable aaf for all following save calls

content = Content.new(:title => 'should not get saved', :description => 'do not find me')
content.disable_ferret(:always)
2.times do 
  content.save          # no indexing takes place
  assert Content.find_by_contents('"find me"').empty?
end
content.ferret_enable   # re-enable automatic indexing
content.save
assert_equal content, Content.find_by_contents('"find me"').first

Disable aaf for a block

content = Content.new(:title => 'should not get saved', :description => 'do not find me')
    content.disable_ferret do
      2.times do
        content.save
        assert Content.find_by_contents('"find me"').empty?
        assert !content.ferret_enabled?        # ferret is disabled inside the block
      end
    end
    assert content.ferret_enabled?             # ... and enabled after that
    assert Content.find_by_contents('"find me"').empty?  # no indexing has taken place

You can even let aaf do an index update after the whole block is finished:

content.disable_ferret(:index_when_finished) do
      2.times do
        content.save
        assert Content.find_by_contents('"find me"').empty?
        assert !content.ferret_enabled?
      end
    end
    assert content.ferret_enabled?
    assert_equal content, Content.find_by_contents('"find me"').first  # it has been indexed!

Default Field

Using the :default_field setting, you can specify the default fields that are searched if non are specified in the query. This is crucial to do when using a single index to search multiple models, as in some cases the presence of stop words in a query will result in no results returned.

From the mailing list, Jens says:

It’s safe for you to specify the same large :default_field list containing
fields from all models in all your acts_as_ferret calls. aaf doesn’t use
this list but only hands it through to Ferret’s query parser which uses
it to expand queries that have no fields specified.

The default_field option determines which fields Ferret will search for
when there is no explicit field specified in a query.

Suppose your index has the fields :id and :text (with id being
untokenized). With an empty default_field value (or ‘*’, which means the
same), and a :or_default value of false (as aaf sets it) you get parsed
queries like this:

‘tree’
—> ‘id:tree text:tree’

‘some tree’ (meaning some AND tree because or_default == false)
—> ‘+(id:some) +(id:tree text:tree)’

With ‘some’ being a stop word, one would expect the second query to
yield the same result as the first one, but since the query is run
against all fields, including :id, which is untokenized and therefore
has no analyzer, we end up querying our id field with a required term
query and get no result at all.

I remember there has been some debate about this topic a year ago or so,
and in theory it would be possible for Ferret to parse queries the other way
around to work around this issue, but afair Dave brought up some good
reasons to leave it as it is.

The solution is to tell Ferret which fields to search when no fields are
specified for a query (or part of a query) with the :default_field
option. Usually aaf does this automatically by collecting all tokenized
fields from the model. Now with a shared index there are n models but
one index, so here we need to have a joint list of all tokenized fields
across all these models for the :default_field parameter.

Since aaf is called in every single model, I didn’t find an easy way to
build this list automatically and decided to leave it up to the user to
specify this list in the acts_as_ferret calls of every model. Not really
DRY indeed. Patches welcome ;-)

Here’s a small script reproducing the issue:
http://pastie.caboo.se/134443

So to summarize:

You need to specify :default_field if you’re using :single_index => true
in combination with :or_default => false (aaf default) and you have
queries that may contain stop words and that are not constrained to a
list of fields specified in the query string.

Hints for older Versions

These apply to aaf <= 0.2.3 and Ferret < 0.10. You really should upgrade.

Specify per field options in the call to acts_as_ferret

You can customize how specific fields get indexed like that:

acts_as_ferret :fields => { 'title'         => { :boost => 2 },
                            'description'   => { :boost => 1.5,
                                                 :index => Ferret::Document::Field::Index::UNTOKENIZED },
                            'another_field' => {}
                          }

This call would give a boost (a rating of importance to the index) of 2 to the title field, one of 1.5 to the description field. The contents of the description field won’t be tokenized on indexing time. The ‘another_field’ field will be indexed using the acts_as_ferret’s default indexing options, which are:

:store       => Ferret::Document::Field::Store::NO
  :index       => Ferret::Document::Field::Index::TOKENIZED
  :term_vector => Ferret::Document::Field::TermVector::NO
  :binary      => false
  :boost       => 1.0

Please see the Ferret and/or Lucene documentation for more information on what these settings mean.