全文检索框架：haystack在Django中的使用

云洋

2023-12-01

全文检索框架：haystack

全文搜索引擎：whoosh，solr，Xapian，Elasticsearch

中文分词包：jieba

第一步：安装hsystack，首先进入你的虚拟环境，如果未使用虚拟环境则可忽略，直接安装。

workon 虚拟环境名
pip3 insatll django-haystack

第二步：安装全文搜索引擎whoosh，他是用纯Python写的

pip3 install whoosh

第三步：进入settings.py配置文件中，将haystack注册进应用中，同时对其进行配置。

INSTALL APPS = [
  ....
  ...
  ...
  'haystack'
]

# 需要手动添加
HAYSTACK_CONNECTIONS = {
    'default': {
        # 使用whoosh引擎
        'ENGINE': 'haystack.backends.whoosh_backend.whooshEngine',
        # 索引文件路径（自动生成）
        'PATH': os.path.join(BASR_DIR, 'whoosh_index'),
      }
}

# 当添加，修改，删除数据时，自动生成索引。
HASTACK_SIGNAL_PROCESSOR = 'haystack.signals.RealtimeSignalProcessor'

第四步：对要搜索的模型创建模型索引，在该模型类所属应用下创建文件 search_indexes.py（文件名固定写法），创建模型索引类。

# 定义索引类
from haystack import indexes
# 导入自定义模型类
from goods.models import GoodsSKU


# 创建模型索引类（对指定模型的某些数据建立索引）
# 索引类名一般格式：模型类名+Index
class GoodsSKUIndex(indexes.SearchIndex, indexes.Indexable):
    # 索引字段：use_template指定根据表中哪些字段建立索引文件，把说明放在一个文件中
    text = indexes.CharField(document=True, use_template=True)

    def get_model(self):
        # 返回你的模型类
        return GoodsSKU

    # 建立索引的数据
    def index_queryset(self, using=None):
        return self.get_model().objects.all()

第五步：在templates文件夹下创建search文件夹（名字固定），在search文件夹下创建goods文件夹（文件名：索引模型类对应应用的名字），在goods文件夹下创建文件goodssku_text.txt（文件名：为模型类类名小写_text.txt）

# 根据表中哪些字段建立索引数据
{{ object.name }}  # 根据上名名称建立索引
{{ object.desc }}  # 根据商品简介建立索引
{{ object.goods.detail }}  # 根据商品详情建立索引

第六步：执行命令建立索引

python3 manage.py rebuild_index

第七步：将搜索框使用form表单进行提交。

<form method="get" action="/search">
    <input type="text" class="input_text fl" name="q" placeholder="搜索商品">
    <input type="submit" class="input_btn fr" name="" value="搜索">
</form>

第八步：项目总url.py文件新增路由

url(r'^search/', include('haystack.urls')),

第九步：haystack会将搜索结果传递给templates/search目录下的search.html，传递上下文包括，所以我们需要依次创建search.html文件，即：templates/search/search.html，可以直接对我们首页展示页进行修改即可。

query：搜索关键字
age：当前页的page对象，------->遍历page对象，获取到的是SearchResult类的实例对象，对象的属性object才是模型类对象
paginator：分页paginator对象（haystack会将获取到的内结果进行分页），通过HAYSTACK_SEARCH_RESULTS_PER_PAGE可以控制每页显示数量。

# search.html中遍历搜索结果展示
{% for item in page %}
    <li>
        <a href="{% url "goods:detail" item.object.id %}"><img src="{{ item.object.image.url }}"></a>
        <h4><a href="{% url "goods:detail" item.object.id %}">{{ item.object.name }}</a></h4>
	    <div class="operate">
            <span class="prize">￥{{ item.object.price }}</span>
			<span class="unit">{{ item.object.price }}/{item.object.unite}</span>
			<a href="#" class="add_goods" title="加入购物车"></a>
		</div>
    </li>
{% endfor %}



# search.html中循环分页页码展示
{% if page.has_previous %}
    <a href="/search?q={{ query }}&page={{ page.previous_page_number }}">上一页</a>
    {% endif %}
    {% for pindex in paginator.page_range %}
        {% if pindex == page.number %}
		<a href="/search?q={{ query }}&page={{ pindex }}" class="active">{{ pindex }}</a>
        {% else %}
		<a href="/search?q={{ query }}&page={{ pindex }}">{{ pindex }}</a>
        {% endif %}
    {% endfor %}
    {% if page.has_next %}
    <a href="/search?q={{ query }}&page={{ page.next_page_number }}">下一页></a>
{% endif %}

第十步：使用jieba中文分词词典替换whoosh自带分词包

第一步，安装jieba中文分词包

pip3 install jieba

>>> import jieba
>>> str = "伟大领袖毛主席"
>>> resp = jieba.cut(str, cut_all=True)
>>> resp
<generator object Tokenizer.cut at 0x7f3e2aed5db0>
>>> for val in resp:
...     print(val)
... 
Building prefix dict from the default dictionary ...
Dumping model to file cache /tmp/jieba.cache
Loading model cost 0.918 seconds.
Prefix dict has been built succesfully.
伟大
伟大领袖
领袖
毛主席
主席
>>>

第二步：进入虚拟环境下的haystack/backends文件夹
.virtualenvs/你的虚拟环境名/lib/python3.6/site-packages/haystack/backends文件夹，新建ChineseAnalyzer.py文件，写入以下内容。

import jieba
from whoosh.analysis import Tokenizer, Token


class ChineseTokenizer(Tokenizer):
    def __call__(self, value, positions=False, chars=False,
            keeporiginal=False, removestops=True,
            start_pos=0, start_char=0, mode='', **kwargs):
        t = Token(positions, chars, removestops=removestops, mode=mode, **kwargs)
        seglist = jieba.cut(value, cut_all=True)
        for w in seglist:
            t.original = t.text = w
            t.boost = 1.0
            if positions:
                t.pos = start_pos + value.find(w)
            if chars:
                t.startchar = start_char + value.find(w)
                t.endchar = start_char + value.find(w) + len(w)
            yield t
            
def ChineseAnalyzer():
    return ChineseTokenizer()

第三步：将上述文件夹下的whoosh_backend.py文件拷贝一份命名为whoosh_cn_backend.py，修改内容如下：

# 导入我们新建的文件
from .ChineseAnalyzer import ChineseAnalyzer


# 用ChineseAnalyzer去替换whoosh自带的分词类
# 将下面行（164行）
schema_fields[field_class.index_fieldname] = TEXT(stored=True, analyzer=StemmingAnalyzer(), field_boost=field_class.boost, sortable=True)
# 替换为
schema_fields[field_class.index_fieldname] = TEXT(stored=True, analyzer=ChineseAnalyzer(), field_boost=field_class.boost, sortable=True)

第四步：修改settings.py文件中全文检索配置的引擎信息


HAYSTACK_CONNECTIONS = {
      'default': {
            # 使用whoosh引擎
            # 'ENGINE': 'haystack.backends.whoosh_backend.whooshEngine',
            'ENGINE': 'haystack.backends.whoosh_cn_backend.whooshEngine',
            # 索引文件路径（自动生成）
            'PATH': os.path.join(BASR_DIR, 'whoosh_index'),
      }
}

第五步：执行命令重新建立索引

python3 manage.py rebuild_index

全文检索框架：haystack在Django中的使用