Django 教程之添加搜索Django Haystack 全文检索与关键词高亮

马梓

2023-12-01

当前关于django-haystack的教程大多同质化，当然我的教程也是参考后实战出来的，这里我将补充说明haystack中有关page和page_obj的不同使用（推荐page_obj）和搜索关键字高亮的时候其他字显示...

需求

能够根据用户的搜索关键词对搜索结果进行排序（日期）以及高亮搜索关键字

安装依赖包

当前环境：

windows10 64
python3
django 1.11

当前版本：

django-haystack
whoosh
jieba

django-haystack

Haystack是一个专门提供搜索功能的应用，django-haystack模块为Django提供了模块化的搜索服务，它的主要特点是提供统一相似的API，支持 Solr、Elasticsearch、Whoosh、Xapian 等多种搜索引擎，能够在插入不同的搜索后端后不用修改所写的代码，减少代码量，使自定义搜索的集成尽可能容易，专注于搜索功能。

pip install django-haystack

Whoosh

Whoosh是纯Python编写的、索引文本及搜索文本的类和函数库，它的优势是易用性强、比较小巧、配置简单方便。但Whoosh本身只有英文分词，如果需要中文分词组件，需要单独添加。这里使用Whoosh作为django-haystack模块的搜索引擎。

pip install whoosh

jieba

jieba的初衷时做最好的 Python 中文分词组件，它是一个强大的分词库，完美支持中文分词，所以使用jieba作为Whoosh的中文分词组件

pip install jieba

参数配置

Haystack注册

blog --> blog --> settings.py

在INSTALLED_APPS内注册应用，注意APP注册有顺序，所以haystack要放在靠前

INSTALLED_APPS = [
   	...
    'haystack',
	...
]

Haystack和Whoosh配置

import os
HAYSTACK_CONNECTIONS = {
    'default': {
        'ENGINE': 'haystack.backends.whoosh_backend.WhooshEngine',
        'PATH': os.path.join(BASE_DIR, 'whoose_index'),
    }
}

HAYSTACK_SEARCH_RESULTS_PER_PAGE = 5

HATSTACK_SIGNAL_PROCESSOR = 'haystack.signals.RealtimeSignalProcessor'

ENGINE：指定 django haystack 使用的搜索引擎，这里配合的是Whoosh，使用haystack.backends.whoosh_backend.WhooshEngine，后面添加jieba中文分词，会对其进行更改
PATH：指定索引文件需要存放的位置，当前设置为项目根目录 BASE_DIR 下的 whoosh_index 文件夹（在建立索引时会自动创建），后面我要进行指定
HAYSTACK_SEARCH_RESULTS_PER_PAGE：对搜索结果分页，这里设置为每 5 项结果为一页。后面单独说
HAYSTACK_SIGNAL_PROCESSOR：定义每当有文章更新时（Models数据库更新）就自动更新索引，由于博客文章更新不会太频繁，因此实时更新没有问题，已弃用1.x版本用在search_indexes.py的haystack.indexes.RealTimeSearchIndex

Whoosh和jieba配置

添加jieba中文分词：

复制虚拟环境下whoosh_backend.py文件到APP目录下，即./Lib/site-packages/haystack/backends/whoosh_backend.py复制到blog --> apps --> config根目录下
将whoosh_backend.py更名为whoosh_cn_backend.py，可自定义

编辑whoosh_cn_backend.py，大致在160–170行之间，修改如下：

导入jieba分词器from jieba.analyse import ChineseAnalyzer
替换analyzer=StemmingAnalyzer()为analyzer=ChineseAnalyzer()

原始

...
schema_fields[field_class.index_fieldname] = TEXT(stored=True, analyzer=StemmingAnalyzer(), field_boost=field_class.boost, sortable=True)
...

修改为

from jieba.analyse import ChineseAnalyzer

...
schema_fields[field_class.index_fieldname] = TEXT(stored=True, analyzer=ChineseAnalyzer(), field_boost=field_class.boost, sortable=True)
...

更改blog --> blog --> settings.py下HAYSTACK_CONNECTIONS的ENGINE为config.whoosh_cn_backend.WhooshEngine

import os
HAYSTACK_CONNECTIONS = {
    'default': {
        'ENGINE': 'config.whoosh_cn_backend.WhooshEngine',
        'PATH': os.path.join(BASE_DIR, 'whoose_index'),
    }
}

HAYSTACK_SEARCH_RESULTS_PER_PAGE = 5

HATSTACK_SIGNAL_PROCESSOR = 'haystack.signals.RealtimeSignalProcessor'

规划PATH位置，更改blog --> blog --> settings.py下HAYSTACK_CONNECTIONS的PATH为

import os
HAYSTACK_CONNECTIONS = {
    'default': {
        'ENGINE': 'config.whoosh_cn_backend.WhooshEngine',
        'PATH': os.path.join(BASE_DIR, 'templates', THEME, 'templates', 'search', 'whoose_index'),
    }
}

HAYSTACK_SEARCH_RESULTS_PER_PAGE = 5

HATSTACK_SIGNAL_PROCESSOR = 'haystack.signals.RealtimeSignalProcessor'

创建索引

初识SearchIndex

SearchIndex对象是Haystack决定应在搜索索引中放置哪些数据并处理数据流的方式，索引是针对APP的，需要为哪个APP创建索引，就要在哪个APP目录下创建search_indexes.py文件，且必须是这个名字，不能更改。

SearchIndex 的字段名称非常标准化，要构建一个SearchIndex ，通常引入indexes.SearchIndex和 indexes.Indexable，定义要与之一起存储数据的字段，并定义一个get_model方法，返回需要检索的模型类。

search_indexes.py文件中需要创建ArticleIndex来和Article模型进行匹配，使得 Haystack可以自动识别。也就是说，想搜索哪个模型，类名就是模型名称+Index，即ArticleIndex

初识search_indexes.py

from haystack import indexes
from .models import Article

class ArticleIndex(indexes.SearchIndex, indexes.Indexable):
    text = indexes.CharField(document=True, use_template=True)

    def get_model(self):
        return Article

    def index_queryset(self, using=None):
        return self.get_model().objects.all()

每个索引里面必须有且只能有一个字段为 document=True，这代表haystack 和搜索引擎将使用此字段的内容作为索引进行检索。

如果使用一个字段设置了document=True，则一般约定此字段名为text，这是在SearchIndex类里面一贯的命名，以防止后台混乱，当然名字你也可以随便改，不过不建议改。
haystack提供了use_template=True在text字段，这样就允许我们使用数据模板去建立搜索引擎的索引。说得通俗点就是想用什么来作为索引进行搜索，例如想搜索模型Article中的title字段和body字段，就要配置相应字段，use_template的默认数据模板存放在templates/search/indexes/app/article_text.txt文件中，文件内容如下：
```
{{ object.title }}
{{ object.body }}
```
注意：

app即search_indexes.py文件所在应用app的名称，

article_text.txt 中article是模型名称，即Article，这里要全小写在加上 _text.txt

完善索引

使用自定义的数据模板，即对article_text.txt更名并放到想放的位置，引入template_name
返回自定义数据的索引，按照创建时间逆序，此项设置亲测并不生效，需要自定义views.py

from haystack import indexes
from .models import Article
# import datetime

class ArticleIndex(indexes.SearchIndex, indexes.Indexable):
    text = indexes.CharField(document=True, use_template=True, template_name="search/indexes/config/article_text.txt")

    def get_model(self):
        return Article

    def index_queryset(self, using=None):
    	# return self.get_model().objects.filter(create_date__lte=datetime.datetime.now())
        return self.get_model().objects.all()

视图与路由

**这里我提供两种方法进行，涉及返回结果是page还是page_obj的不同使用，只需选择一种进行配置即可 **

所谓的page还是page_obj都是 SearchResult 对象列表，而不是单个模型

page使用

page配置URL

项目URL

from django.conf.urls import url, include

urlpatterns = [
	...
    url(r'^search/', include('haystack.urls')),
    ...
]

page配置搜索表单

在header.html里

<form method="get" id="searchform" action="{% url 'haystack_search' %}">
    <div class="input-group">
        <input type="search" class="blog-header-search" placeholder="search..." name="q">
        <button type="submit" class="blog-header-search-btn"><i class="fa fa-search"></i></button>
    </div>
</form>

form表单的method 是’get’不能改变，搜索栏的文本框的name=“q” 也是固定不变的
action 是表单的提交地址，对搜索的处理交由搜索引擎进行处理

page创建搜索结果页面

haystack_search 视图函数会将搜索结果传递给模板templates/search/search.html，因此创建这个模板文件，对搜索结果进行渲染：

{% extends 'base.html' %}

{% load blog_simple_tag %}

{% block content_detail %}

<header class="blog-post-page-title">
   <!--<marquee><font color="blue" size="2">You will make it!</marquee>-->
    <h3>当前位于 “搜索” 页面， 关键词：{{ query }}</h3>
</header>

<div class="blog-main-post">
    {% for article in page.object_list  %}
        <div class="index-post-br"></div>
        <article class="blog-post-block">
            {% if article.object.img_link %}
                <header>
                    <div class="blog-post-block-img">
                        <img src="{{ article.object.img_link }}" alt="">
                    </div>
                </header>
            {% endif %}
            <div class="blog-post-block-padding">
                <a href="{% url 'blog:article' article.object.id %}">{% search_highlight article.object.title query %}</a>
                <section>
                    {{ article.object.summary }}
                </section>
                <footer>
                    <span>
                        <i class="fa fa-folder-o"></i>
                        <a href="{% url 'blog:category' article.object.category %}" itemprop="url" rel="index">{{ article.object.category }}</a>
                    </span>

                    <span>
                        <time datetime="{{ article.object.create_date }}"><i class="fa fa-clock-o"></i>创建于{{ article.object.create_date }}</time>
                        {% if article.object.update_date > article.object.create_date %}
                        <time datetime="{{ article.object.update_date }}"><i class="fa fa-clock-o"></i> 更新于{{ article.object.update_date }}</time>
                        {% endif %}
                    </span>

                </footer>
            </div>
        </article>

    {% empty %}
        <div class="blog-main-post">
            <div class="no-post">没有搜索到相关内容，请重新搜索</div>
        </div>
    {% endfor %}

    {% if page.has_previous or page.has_next %}
        <nav id="pagination" class="blog-pagination" >
            <article class="blog-post-page-readmore">
            {% if page.has_previous %}
                <a class="blog-post-page-readmore-prev" href="?q={{ query }}&amp;page={{ page.previous_page_number }}"
                   data-toggle="tooltip" data-placement="top"
                   title="当前第&nbsp;{{ page.number }}&nbsp;页，共&nbsp;{{ paginator.num_pages }}&nbsp;页">
                    上一页</a>
            {% endif %}


            {% if page.has_next %}
                <a class="blog-post-page-readmore-next" href="?q={{ query }}&amp;page={{ page.next_page_number }}"
                   data-toggle="tooltip" data-placement="top"
                   title="当前第&nbsp;{{ page.number }}&nbsp;页，共&nbsp;{{ paginator.num_pages }}&nbsp;页">
                    下一页</a>
            {% endif %}
            </article>
        </nav>
    {% endif %}
</div>
{% endblock %}


{% block sidebar_toc %}
{% include 'sidebar.html' %}

{% endblock %}

query是搜索的关键词
page.object_list是返回 SearchResult 类型的结果
search_highlight article.object.title query是关键词高亮，后面单独说
page.has_previous or page.has_next是有关页面的前后页分页

page_obj使用

page_obj配置视图views.py

配置了视图，search_indexes.py返回时间逆序数据的索引将失效（亲测本来就不生效），以views.py中queryset的id倒叙为准，所以想自定义QuerySet，还是需要采用此方法

from haystack.generic_views import SearchView

class BlogSearchView(SearchView):
    context_object_name = 'search_list'
    queryset = SearchQuerySet().order_by('-id')

虽然 SearchView 在 haystack.views 中也存在，但此 SearchView 并没有属性 as_view，不要混淆，不要使用
from haystack.views import SearchView

page_obj配置URL

应用APP的URL

from django.conf.urls import url, include
from .views import BlogSearchView

urlpatterns = [
	...
    url(r'^search/$', BlogSearchView.as_view(), name='search'),
    ...
]

page_obj配置搜索表单

在header.html里，与page的区别只在于action

<form method="get" id="searchform" action="{% url 'blog:search' %}">
    <div class="input-group">
        <input type="search" class="blog-header-search" placeholder="search..." name="q">
        <button type="submit" class="blog-header-search-btn"><i class="fa fa-search"></i></button>
    </div>
</form>

form表单的method 是’get’不能改变，搜索栏的文本框的name=“q” 也是固定不变的
action 是表单的提交地址，对搜索的处理交由搜索引擎进行处理
blog是项目urls.py里的命名空间，search是应用APP中urls.py里的命名空间

page_obj创建搜索结果页面

haystack_search 视图函数会将搜索结果传递给模板templates/search/search.html，因此创建这个模板文件，对搜索结果进行渲染，与page的区别只在于page_obj替换page，将search.html所有page替换为page_obj。

高亮关键词

高亮样式

HTML和CSS方式二选一即可

HTML中添加style

base.html，加入到header中

    <style>
        span.highlighted {
            color: red;
        }
    </style>

css中添加style

blog.css，加到最后

.highlighted {
    color: red;
}

标题高亮

之所以把标题高亮单独拿出来说明，是因为使用highlight默认的高亮方法会出现标题不能全部显示的问题，搜索关键字高亮的时候其他字显示...

原因：

myblog->venv->Lib->site-packages->haystack->utils->highlighting.py

start_offset 与 end_offset 分别代表高亮代码的开始位置与结束位置，如果高亮部分在中间的话，前面的部分就直接显示 …

        if start_offset > 0:
            highlighted_chunk = '...%s' % highlighted_chunk

        if end_offset < len(self.text_block):
            highlighted_chunk = '%s...' % highlighted_chunk

        return highlighted_chunk

解决办法：

第一种：`

在上述文件中 myblog->venv->Lib->site-packages->haystack->utils->highlighting.py加入判断语句，大约158行，如果字符串长度小于 max_length 的值的话，我们就直接将其返回

highlighted_chunk += text[matched_so_far:]

if len(self.text_block) < self.max_length:  
    return self.text_block[:start_offset] + highlighted_chunk

if start_offset > 0:
    highlighted_chunk = '...%s' % highlighted_chunk

if end_offset < len(self.text_block):
    highlighted_chunk = '%s...' % highlighted_chunk
return highlighted_chunk

{% highlight article.object.title with query %}

{% load highlight %}

            <div class="blog-post-block-padding">
                <a href="{% url 'blog:article' article.object.id %}">{% highlight article.object.title with query %}</a>
                <section>
                    {{ article.object.summary }}
                </section>

第二种（推荐）{% search_highlight article.object.title query %}

{% load blog_simple_tag %}

<div class="blog-post-block-padding">
                <a href="{% url 'blog:article' article.object.id %}">{% search_highlight article.object.title query %}</a>
                <section>
                    {{ article.object.summary }}
                </section>

blog_simple_tag.py

@register.simple_tag
def search_highlight(text, q):
    """自定义标题搜索词高亮函数，忽略大小写"""
    if len(q) > 1:
        try:
            text = re.sub(q, lambda a: '<span class="highlighted">{}</span>'.format(a.group()),
                          text, flags=re.IGNORECASE)
            text = mark_safe(text)
        except:
            pass
    return tex

其他高亮

使用{% highlight article.object.summary with query %}语句或者

<section>
{% highlight article.object.summary with query %}
{% search_highlight article.object.summary query %}
</section>

如果文章摘要部分来自文章内容，则可能用到max_length参数，限制内容长度{% ... max_length 130 %}

结果分页

在前面“Haystack和Whoosh配置”中已经在settings.py中配置HAYSTACK_SEARCH_RESULTS_PER_PAGE分页功能

...
HAYSTACK_SEARCH_RESULTS_PER_PAGE = 5
...

HAYSTACK_SEARCH_RESULTS_PER_PAGE：对搜索结果分页，这里设置为每 5 项结果为一页。

模板中使用分页

search.html

haystack中已自带分页功能，我们只需调用，确定好page还是page_obj即可

 {% if page.has_previous or page.has_next %}
        <nav id="pagination" class="blog-pagination" >
            <article class="blog-post-page-readmore">
            {% if page.has_previous %}
                <a class="blog-post-page-readmore-prev" href="?q={{ query }}&amp;page={{ page.previous_page_number }}"
                   data-toggle="tooltip" data-placement="top"
                   title="当前第&nbsp;{{ page.number }}&nbsp;页，共&nbsp;{{ paginator.num_pages }}&nbsp;页">
                    上一页</a>
            {% endif %}


            {% if page.has_next %}
                <a class="blog-post-page-readmore-next" href="?q={{ query }}&amp;page={{ page.next_page_number }}"
                   data-toggle="tooltip" data-placement="top"
                   title="当前第&nbsp;{{ page.number }}&nbsp;页，共&nbsp;{{ paginator.num_pages }}&nbsp;页">
                    下一页</a>
            {% endif %}
            </article>
        </nav>
    {% endif %}

建立索引文件

python manage.py rebuild_index
或
python manage.py uodate_index