SWISH-E搜索引擎用法

颛孙钱青

2023-12-01

SWISH-E搜索引擎用法

SWISH-E搜索引擎，不同于Lucene等搜索引擎，SWISH-E是可以独立执行的工具，通过设定配置文件和执行命令，不需要编写其他程序，即可完成文档的索引和检索。但是在SWISH-E 2.X版本中，不支持UNICODE字符集，在3.0版本中开始支持，如果只是英文，太局限了J。

SWISH-E使用的核心是配置文件的设定，通过设定配置文件，可以完成相应的工作。

1．配置过程

配置文件每行设定一个SWISH-E配置指令（configuration directives），其中#为注释行。

通过对于编写配置文件，SWISH-E可以很好的完成检索任务，最基本的配置指令（configuration directives）有：

IndexDir 搜索的目录

IndexOnly 搜索的文件类型

MetaNames 需要搜索的TAG标志（在索引XML和HTML文件时使用）

WordCharacters 通过WordCharacters定义将文档拆分为单词。如果字符串中含有的字符没有在WordCharacters字符集中的，则会划分为多个单词。比如：

WordCharacters abde

则会划分为："ab" and "de".

与之相关的指令还有：IgnoreFirstChar、IgnoreLastChar、BeginCharacters、EndCharacters。例如：

WordCharacters  .abcdefghijklmnopqrstuvwxyz

    BeginCharacters abcdefghijklmnopqrstuvwxyz

    EndCharacters   abcdefghijklmnopqrstuvwxyz

    IgnoreFirstChar .

    IgnoreLastChar  .

如果字符串为：

  Please visit http://www.example.com/path/to/file.html.

则会划分为：

please

    visit

    http

    www.example.com

    path

to

    file.html

此时，如果建立索引以后，通过www.example.com可以进行搜索，但是仅仅通过example则无法搜索到文档。

Buzzwords [*list of buzzwords*|File: path]

流行词：不受字符集的限制，索引特定的词语。

比如：Buzzwords C++ TCP/IP。

IgnoreWords [*list of stop words*|File: path]

设定特殊词，进行忽略，通常称为stopwords.

UseWords [*list of words*|File: path]

只有通过UseWords定义的词，才会进行索引。

FileRules [type] [contains|is|regex] *regular expression*

FileMatch [type] [contains|is|regex] *regular expression*

FileRule和FileMatch指令排除和包含索引的目录和文件。

MatchType，匹配类型：

FileRules pathname

    FileRules dirname

    FileRules filename

    FileRules directory

    FileRules title

    FileMatch pathname

    FileMatch filename

    FileMatch dirname

比如：

# Don't index paths that contain private or hidden

    FileRules pathname contains (private|hidden)

基本示例，假设配置文件myconf为：

IndexDir ./ 索引的目录

IndexOnly .html 索引的文件类型

MetaNames subjects 需要索引的TAG标志

该配置文件定义了：索引的目录为当前目录，文件类型为html文件，需要索引的TAG为subjects.

然后输入命令：

[root@moxuansheng demo]# swish-e -c myconf

索引完成后，显示结果。

Indexing Data Source: "File-System"

Indexing "./"

Removing very common words...

no words removed.

Writing main index...

Sorting words ...

Sorting 29 words alphabetically

Writing header ...

Writing index entries ...

Writing word text: Complete

Writing word hash: Complete

Writing word data: Complete

29 unique words indexed.

4 properties sorted.

4 files indexed. 310 total bytes. 34 total words.

Elapsed time: 00:00:00 CPU time: 00:00:00

Indexing done!

2．搜索过程

Swish-e –w word

-m *number* (max results)

-b *number* (beginning result)

当搜索的结果集合较大时，可以通过对于-m和–b的结合，分屏显示搜索结果。

按照前面索引的过程，举例说明。

SWISH-E通过命令：

[root@moxuansheng demo]# swish-e -w securities

进行搜索，并且可以通过and 和or进行相应的逻辑运算。

显示结果为：

# SWISH format: 2.4.5

# Search words: securities

# Removed stopwords:

# Number of hits: 1

# Search time: 0.002 seconds

# Run time: 0.012 seconds

1000 ./new.txt "new.txt" 64

如果需要搜索html文件中的内容，则需要制定相应的TAG标志，其中subjects为TAG名称。

[root@moxuansheng demo]# swish-e -w subjects=content

显示结果为：

# SWISH format: 2.4.5

# Search words: subjects=content

# Removed stopwords:

# Number of hits: 1

# Search time: 0.002 seconds

# Run time: 0.012 seconds

1000 ./my.html "my.html" 96

3．网页抓取索引

通过swish-e可以进行网页的抓取，解析、索引。

配置文件为：

IndexDir 目标URL

http://www.msn.com

#how many links the spider should follow before stopping.

#设定抓取URL的层数

MaxDepth 2

#临时目录，SWISHSPIDER要用，否则会出错

TmpDir /home/kuangtu/swish-e-2.4.5/demo/tmpDir

#the location of Perl helper script called swishspider.

#配置swishspider所在的目录

SpiderDirectory /home/kuangtu/swish-e-2.4.5/src

通过swish-e -c myconf -S http进行索引。设定-S 参数。

另外，由于swishspider采用PERL脚本编写。PERL用到了LWP:UserAgent和HTTP:Parser模块，我的机器环境没有安装，需要单独安装。下载：HTML-Parser-3.59.tar.gz；HTML-Tagset-3.20.tar.gz；libwww-perl-5.823.tar.gz，进行安装。

SWISH-E搜索引擎用法

相关阅读

相关文章

相关问答

相关文档