全文检索之Ferret

祁正阳

2023-12-01

什么是Ferret

Ferret，是用ruby开发的基于Apache Lucene的全文检索引擎库，安装Ferret：

gem install ferret

在ferret的代码中，只有少量的ruby代码，大部分是c代码。这里有Ferret API，并在其中提供了一份教程Ferret Tutorial。

Acts_As_Ferret

Ferret是ruby库，在rails中如果想使用，就要用到Jens Kramer 的Acts As Ferret了，它提供了简单的接口，我们可以快速的创建复杂的搜索索引。

在你的rails中，以插件的形式安装acts_as_ferret

ruby script/plugin install svn://projects.jkraemer.net/acts_as_ferret/tags/stable/acts_as_ferret

基本用法

下面我们从一个简单的用法开始。

首先需要在你的model中添加需要被索引的项目

class Member < ActiveRecord::Base
acts_as_ferret :fields => [:first_name, :last_name]
end

在这里我们是将特定的名字进行索引，搜索的时候将返回这些索引的符合结果。

做一个简单的搜索实例

Acts as Ferret为你的ActiveRecord model增加了搜索方法，和其他的教程不同，我们使用find_id_by_contents：

当我们调用

total_results, members = Member.find_id_by_contents(”Gregg”)

那么：

在我们的rails应用中创建了一个 /index/development/member 文件夹，索引文件将会保留在这里。
所有Member model的查询都会被保存，并且对first_name 和 last_name进行索引。每当对Member进行 add/update/delete 操作时，索引就会自动更新。如果你需要重新生成一个索引，只需要删除对应的文件夹，重启服务，这样在下次进行表查询时将会重新建立索引。
acts_as_ferret会调用ferret的 Search_Each 方法处理索引
我们得到一些返回数据

members = [
{:model => “Member”, :id => “4″, :score => “1.0″},
{:model => “Member”, :id => “21″, :score => “0.93211″},
{:model => “Member”, :id => “27″, :score => “0.32212″}
]

我们得到了前10条记录（当然我只显示了3条），包括了每条记录的id和搜索得分(search scores)。

但是，当返回记录有40条的时候，我们仍将得到10条记录。

如何得到多于10条记录呢？

find_id_ by_contents可以传递一些参数进去：

offset：默认是0。The offset of the start of the section of the result-set to return(译者：我脑海中的说法和英文的不一样，就是返回结果所需要的偏移量，默认是从0记录开始，返回10条记录，但是如果offset为10，那就是从第10条开始，返回10条记录。用中文这么一句话我说不好。呵呵)。这个用于对返回结果进行分页。
limit：默认是10。返回你想要得到的结果数。也是在分页中被使用。

在find_id_by_contents使用代码块

results = []
total_results = Member.find_id_by_contents(”Gregg”) {|result|
results.push result
}

在这，你可能并不想只得到结果的id，那么可以在这里做一个转换，得到返回的model集合。

results = []
total_results = Member.find_id_by_contents(”Gregg”) {|result|
results.push Member.find(result[:id])
}

当然，这有更好的办法

使用find_by_contents

@results = Member.find_by_contents(”Gregg”)

find_by_contents进行了下面的操作：

调用find_id_by_contents，得到id集合
跟踪所有返回的id，得到实际的model
返回一个貌似ActiveRecord的集合，其实它是一个ActsAsFerret::SearchResults类，下面是它的一些额外特性

注意哦

members = Member.find_by_contents(”Gregg”)

# It gives us total hits!
puts “Total hits = #{members.total_hits}”
for member in members
puts “#{member.first_name} #{member.last_name}”

# And the search Score!
puts “Search Score = #{member.ferret_score}”
end

注意里面的total_hits和ferret_score，在数据库中他们并不存在的哦。

进行分页

注：下面代码中使用了Roman Mackovcak’s blog的例子。

在model中添加下面这个方法

def self.full_text_search(q, options = {})
return nil if q.nil? or q==”"
default_options = {:limit => 10, :page => 1}
options = default_options.merge options

# get the offset based on what page we’re on
options[:offset] = options[:limit] * (options.delete(:page).to_i-1)

# now do the query with our options
results = Member.find_by_contents(q, options)
return [results.total_hits, results]
end

在你的application.rb中添加

def pages_for(size, options = {})
default_options = {:per_page => 10}
options = default_options.merge options
pages = Paginator.new self, size, options[:per_page], (params[:page]||1)
return pages
end

在controller中添加

def search
@query = params[:query]
@total, @members = Member.full_text_search(@query, :page => (params[:page]||1))
@pages = pages_for(@total)
end

在页面中

<%= link_to ‘Previous page’, { :page => @pages.current.previous, :query => @query} if @pages.current.previous %>
<%= pagination_links(@pages, :params => { :query=> @query }) %>
<%= link_to ‘Next page’, { :page => @pages.current.next, :query => @query} if @pages.current.next %>

上面的代码可以完成大部分工作了，但是acts_as_ferret还有其他的优秀特性的。

其他形式的查询字串

“Gregg Pollack”将在所有字段中搜索”Gregg”和”Pollack”
“Gregg OR Pollack”将搜索”Gregg”或”Pollack”
“Gregg~”模糊搜索，返回搜索包含”Gregg”字样的结果
“first_name:Gregg”，搜索first_name是”Gregg”的记录，排除其他索引。
“+first_name:Gregg -last_name:Jones”，布尔查询，查询所有first_name是”Gregg”并且last_name不是”Jones”的记录

更多复杂查询，可以参考 Apache Lucene Parser Syntax。

添加非model和非字段（Adding Non-Model or Non-Standard Fields）

现在对我们例子做一个修改，我们有许多书，每本书有一些作者，如果你不仅要索引书的标题，还要索引书的作者，该怎么办呢？

我们需要操作的是两个表，可是我们不能去对两个不同的索引进行查找。这时需要修改我们的model
/model/book.rb

class Book < ActiveRecord::Base
acts_as_ferret :fields => [:title, :author_name]

def author_name
return “#{self.author.first_name} #{self.author.last_name}”
end
end

这样在搜索书的标题时，书的作者也能被搜索到。

你可以对任何model方法的返回值进行索引，甚至可以重新格式化你的字段(fields)。

比如你在使用 acts_as_taggable ，对你的model进行tag标注，并且希望在搜索的时候你的tag一并被搜索到。那么：

class Book < ActiveRecord::Base
acts_as_taggable
acts_as_ferret :fields => [:title, :tags_with_spaces]

def tags_with_spaces
return self.tag_names.join(” “)
end
end

原文：If you were using the acts_as_taggable plugin you might not even need the extra function, and use “:tag_list” in the ferret field list, as shown onJohnny’s Thoughts. I’m not nearly as cool though, I’m using the acts_as_taggable gem.

译文：如果你在以插件形式使用acts_as_taggable，那么就会出现Johnny’s Thoughts中提到的问题，（具体情况大家去看那个博客吧，这个地方留意一下）。这时你需要在ferret索引字段列表使用”:tag_list”。译者：因为作者也在用插件形式使用ferret。插件冲突的事情由此也需要注意了。

排序

到目前为止，我们得到的记录都是排好序列的搜索结果。那么如果我们想得到按照我们想根据的字段进行排列的结果，比如书的标题，该如何做呢？

原文：The first thing you need to do is make sure the field you are trying to sort by is untokenized. Unfortunately, by making a field untokenized I’m not indexing it to be searchable anymore. This makes for a little funky coding.

作者：”if something is untokenized it will not be searchable ”

译者：首先你需要确定，需要排序的字段没有被索引过。所以我们要对上面的代码做一点修改。

acts_as_ferret :fields => {
:title => {},
:tags_with_spaces => {},
:title_for_sort => {:index => :untokenized}
}

def title_for_sort
return self.title
end

记得，如果你想重新建立索引，只需要删除对应的文件夹，并重启服务。
译者：重建索引也可以实用Model.rebuild_index这个方法。

s = Ferret::Search::SortField.new(:title_for_sort, :reverse => false)
@total, @members = Book.full_text_search(@query,
{:page => (params[:page]||1), :sort => s})

这样我就得到了按照书的title的排序。

如果你想按照日期排序，那就需要把日期转换成integer类型，具体的请参考this Slash Dot Dash

字段储存

在进入下一节（相当重要的一节）之前，我们需要研究下如何储存已被索引的数据(data)。

如果你现在看一下你的索引文件(indexes)，你会发现那里并没有你的数据。默认情况下，acts_as_ferret在这种可复写的情况下并不储存你的数据(in a recoverable form)，仅是索引它。

“那么，如果我的数据很小，我想在我的索引中储存它，该怎么办呢？”

这是个好问题，如果你的数据很小，而且你只是关注一个字段的信息，你可以加速你的索引。

acts_as_ferret :fields => {
:title => {:store => :yes},
:author_name => {:store => :yes}
}

当我们进行查询时，我们需要对这个特殊的字段进行说明（”lazy load”）

@books = Book.find_by_contents(”Jason”, :lazy => [:title, :author_name])

这样我们在渲染页面时，并没有调用数据库。

< % @books.each do |book| %>
%lt;li> “< %= book.title %>” by
< %= book.author_name %>%lt;/li%gt;
< % end %>

高亮显示搜索词

下面是用ferret实现搜索词的高亮显示。
不过这有个前提，就是需要对你的搜索词进行储存（must have your search fields stored），上面已经介绍了。
所以需要对上面的代码做一点点修改了：

< % @books.each do |book| %>
<li>
“< %= book.highlight(”Jason”, :field => :title, :num_excerpts => 1, :pre_tag => ““, :post_tag => ““) %>” by
< %= book.highlight(”Jason”, :field => :author_name, :num_excerpts => 1, :pre_tag => ““, :post_tag => ““) %>
</li>
< % end %>

你的搜索结果会是：
1. “Story of Gregg” by Jason Seifer
2. “Jason’s Book” by Gregg Pollack
3. “Gregg certainly is the Man” by Jason Seifer

高亮显示功能还有其他的方法，比如，如果你搜索的字段内容很长，例如博客文章，那么将会返回一个片断，搜索词被高亮显示。更多参考 Highlight in the API

使用Boost（设定搜索优先级）

最后介绍一下Boost属性。Boost属性可以提升索引的优先顺序。

acts_as_ferret :fields => {
:title => {:boost => 2},
:author => {:boost => 0}
}

这段代码表明，title搜索结果要优先于author结果的显示。但是这并不是说，所有的title记录在author记录前显示。如果一条author记录完全匹配搜索词，那么它会优先显示。

Perhaps this feature should be called “Nudge” instead of “Boost”. I thought I could use a large boost to get all the title results to appear above the author results. I was mistaken, one can only “Nudge” the scores, but never separate them, as I was hoping.

译者：大家看一下那个连接的文章，这段英文的意思是，使用boost是提升部分优先级，而别指望它能把title和author分开。

产品环境应用

因为很多人是在产品环境下使用，所以一致的想法认为，你需要运行一个DRB Server