当前位置: 首页 > 工具软件 > zettair > 使用案例 >

Zettair 介绍(全文搜索引擎,基于C语言的一个高效倒排索引模式)



1    Zettair介绍

1.1    Zettair简要说明

Search engines are usually based on a special structure, called inverted index, which is used to answer queries quickly. There are two disadvantages resulting from this approach. First, the inverted index structure most be constructed prior to searching it, and secondly, the index structure takes up additional space on a computer's hard-disk. However, both problems mentioned above are negligible, if an index is queried a few hundred times a day, and can be used to find information that would otherwise be lost in the depths of a pile of documents.

An inverted index is a well researched and understood structure. It is documented and discussed in a few research papers and books, such as MG ("Managing Gigabytes").

These pages begin with a tutorial overview of using Zettair. Then they document in more detail how you to use Zettair to build an index and how to query that index. There are also some pointers for those wishing to hack the Zettair source code.

有任何问题都可以发送邮件zettair@cs.rmit.edu.au 或者拜访主页:

http:// www.seg.rmit.edu.au/zettair/index.html

Feel free to drop us (the Search Engine Group) a line at zettair@cs.rmit.edu.au if you have any questions or comments. Or visit the Zettair home page for more information.
1.2    Zettair引擎优点
http:// hi.baidu.com/shichunqi/item/9666691d060c434b6926bb1b

其中Zettair 是基于C语言编写的,无论是在CPU、内存、还是在索引结构存储消耗空间,还是查询的时间消耗,性能指标都是较为出色的,同时支持增量索引, 结果摘要, 文件类型选择, 词根替换, 结果排序, 排序策略, 搜索类型,,基于小数据集(TREC-4)和大数据集(WT10g), 分析了搜索 引擎的整体性能, Zettair是最完整的开源引擎之一。

1.3    Zettair简单使用说明
1.3.1    前提
This page contains a tutorial-style introduction to Zettair. It will show you how to index a document collection using Zettair, and how to run queries against that index. It only introduces the basic functionality for each use. For more details, see the pages on building and searching using Zettair.
本章节假设你已经下载、解压、编译,而且安装了Zettair各个部分。如果你还还没有那么做,请到Zettair的主页上获取安装包,然后解压,按照安装说明安装Zettair的各个部分。我们假设Zet可执行程序已经安装到你的系统变量PATH目录中了。如果没有,也可以使用全路径指定可执行文件(比如:, /usr/local/zettair/bin/zet)。
This tutorial assumes that you have downloaded, unpacked, compiled, and installed the Zettair distribution. If you have not done so, please grab Zettair from the Zettair home page, unpack it, and follow the installation instructions in the INSTALL file contained within the distribution. We also assume that the zet executable has been installed in your PATH. If not, use the full path for the executable (for instance, /usr/local/zettair/bin/zet).
1.3.2    创建索引
To help you get up and running quickly with Zettair, we have included the full text for Herman Melville's Moby Dick as a sample document collection for you to play around with. This can be found in the subdirectory txt of the Zettair distribution.

zet -i moby.txt

Let's begin by indexing Moby Dick. To do this, change your current directory to txt. (You can index it from anywhere, but this is simplest.) We'll assume that the zet executable is in your PATH; otherwise, substitute the full pathname to the executable wherever you see 'zet' below. So, let's build this index:
$ zet -i moby.txt

-i 的参数告诉zet我们在创建一个新的索引,《白鲸记》文本的大小小于1.3MB,所以不会花费太长的时间。Zettair在实际的应用场景中可以创建10G或者更大的文本集对应的索引,而且速度也不会很慢。当创建完成索引后,在当前目录下会生成4个新文件,全部是以index作为前缀的,都是Zettair的索引文件。

The '-i' argument tells zet that we're building a new index. The text of Moby Dick is less than 1.3 MBs in length, so this won't take long to run - Zettair is more used to working with document collections of 10 GB or more, but it won't complain. When it's finished running, you should see four new files in the current directory, all prefixed with "index" These are Zettair's index files.

1.3.3    查询索引
So now we're ready to run some queries. To do this, we run zet again, this time without any options:
$ zet

Zettair会加载索引(索引不大,会非常快的),然后等待你的输入,下面是我们测试搜索:whale 的结果。
Zettair will load up the index (very quickly, in this case), and then prompt you for input. Let's test the rumour that Moby Dick has something to say about whales:

> whale
1. Chapter 32, Paragraph 46 (score 0.814503, docid 713)
2. Chapter 32, Paragraph 23 (score 0.687340, docid 690)
3. Chapter 32, Paragraph 25 (score 0.542362, docid 692)
4. Chapter 32, Paragraph 8 (score 0.489850, docid 675)
5. Chapter 32, Paragraph 22 (score 0.488983, docid 689)
6. Chapter 32, Paragraph 26 (score 0.484616, docid 693)
7. Chapter 75, Paragraph 10 (score 0.453542, docid 1552)
8. Chapter 32, Paragraph 21 (score 0.433975, docid 688)
9. Chapter 41, Paragraph 7 (score 0.403410, docid 875)
10. Chapter 81, Paragraph 47 (score 0.402218, docid 1646)
11. Chapter 41, Paragraph 3 (score 0.378583, docid 871)
12. Chapter 56, Paragraph 5 (score 0.367106, docid 1236)
13. Chapter 0, Paragraph 74 (score 0.340201, docid 74)
14. Chapter 45, Paragraph 17 (score 0.333519, docid 969)
15. Chapter 32, Paragraph 35 (score 0.332929, docid 702)
16. Chapter 45, Paragraph 5 (score 0.331750, docid 957)
17. Chapter 87, Paragraph 21 (score 0.330964, docid 1723)
18. Chapter 91, Paragraph 6 (score 0.327630, docid 1796)
19. Chapter 55, Paragraph 7 (score 0.326381, docid 1223)
20. Chapter 68, Paragraph 6 (score 0.324882, docid 1411)
20 results of 791 shown (took 0.000702 seconds)

This tells us that the word "whale" occurs in 791 documents in the collection (which is to say, paragraphs in Moby Dick). Zettair thinks the most pertinent paragraph is paragraph 46 of chapter 32. We can ask Zettair to print out this document using the 'cache' directive and specifying the document's docid:

cache:713 命令。
> [cache:713] <DOC> <DOCNO>Chapter 32, Paragraph 46</DOCNO> Beyond the DUODECIMO, this system does not proceed, inasmuch as the Porpoise is the smallest of the whales. Above, you have all the Leviathans of note. But there are a rabble of uncertain, fugitive, half-fabulous whales, which, as an American whaleman, I know by reputation, but not personally. I shall enumerate them by their fore-castle appellations; for possibly such a list may be valuable to future investigators, who may complete what I have here but begun. If any of the following whales, shall hereafter be caught and marked, then he can readily be incorporated into this System, according to his Folio, Octavo, or Duodecimo magnitude:--The Bottle-Nose Whale; the Junk Whale; the Pudding-Headed Whale; the Cape Whale; the Leading Whale; the Cannon Whale; the Scragg Whale; the Coppered Whale; the Elephant Whale; the Iceberg Whale; the Quog Whale; the Blue Whale; etc. <From Icelandic, Dutch, and old English authorities, there might be quoted other lists of uncertain whales, blessed with all manner of uncouth names. But I omit them as altogether obsolete; and can hardly help suspecting them for mere sounds, full of Leviathanism, but signifying nothing. <DOC>>
不用担心<DOC>和<DOCNO>的标记,那些仅仅是 TREC(国际文本检索会议,最为权威的搜索检索大会)的格式经常为了标记创建索引用的。在上面的章节中,你会发现whale这个单词出现的频率很高,所以Zettair认为是你最想要的查询结果。
Don't worry about the <DOC> and <DOCNO> tags: that's just part of the TREC format we've used to mark up Moby Dick for indexing. You'll notice that the word 'whale' occurs often, which is why Zettair thinks this is probably the paragraph you're looking for.
1.3.4    多词查询
You can, of course, query for more than one word at a time. Say we were looking for a particular kind of whale:

> white whale
1. Chapter 42, Paragraph 4 (score 1.429675, docid 897) [...]
20. Chapter 48, Paragraph 2 (score 0.752030, docid 1002)
20 results of 852 shown (took 0.000801 seconds)
Hmm, 852 paragraphs--but "whale" only occurs in 791! Well, what Zettair is reporting here is all the documents with either "white" or "whale" in them. We can tell specify that we only want documents that both occur in:

> white AND whale
1. Chapter 59, Paragraph 4 (score 1.255408, docid 1269) [...] 2
0. Chapter 54, Paragraph 88 (score 0.806199, docid 1191)
20 results of 130 shown (took 0.000330 seconds)

or, probably more to the point, only documents that the exact phrase "white whale" occurs in:

> "white whale"
1. Chapter 36, Paragraph 41 (score 1.357970, docid 789) [...]
20. Chapter 52, Paragraph 4 (score 0.840175, docid 1088) 20 results of 91 shown (took 0.000307 seconds)

1.3.5    文档摘要
This is great so far (or at least, we hope you think so), but it gets tiresome having to individually request each document to see if it's what we're looking for, especially if the documents are longer than a single paragraph. What we really want is for the list of results to include a summary of each document. And we can ask Zettair to provide just this to us.

为了看到搜索结果的摘要,我们需要重新启动Zettair。敲击CONTORL-D或者其他任意键组合来表明你的输入的开始和结束。这一次,我们运行zet 但是带上参数--summary 选项表明我们需要看文档的摘要,而且我们会看到搜索的内容在文档摘要中是什么样子,我们也会限制仅仅只输入搜索的结果的前两位。
To do so, we'll have to restart Zettair. Hit CONTROL-D or whatever key combination indicates end of input on your system to end your current session. This time, we'll run the zet executable with the '--summary' option to indicate that we'd like to see document summaries, and what form we want these summaries to be in. We'll also restrict output to just the top 2 results:
$ zet --summary=capitalise -n 2

Zettair can highlight your search terms within the document summaries in a number of different ways, capitalise being one of them. So, let's try out some summaries:

> ship sea storm
1. Chapter 9, Paragraph 18 (score 2.973852, docid 261) A dreadful STORM comes on, the SHIP is like to break... He sees no black sky and raging SEA, feels not the reeling timbers, and little hears he or heeds he the far rush of the mighty whale, which even now with open mouth is cleaving the SEAS after him.
2. Chapter 121, Paragraph 4 (score 2.431819, docid 2294) What's the mighty difference between holding a mast's lightning-rod in the STORM, and standing close by a mast that hasn't got any lightning-rod at all in a STORM?
2 results of 650 shown (took 0.002139 seconds) > "dark blue ocean" 1. Chapter 35, Paragraph 11 (score 4.953089, docid 745) "Roll on, thou deep and DARK BLUE OCEAN, roll! Ten thousand blubber-hunters sweep over thee in vain." 1 results of 1 shown (took 0.002945 seconds)

And that concludes our tour.

1.4    Zettair索引构建说明
1.4.1    索引构建说明
Zettair can build inverted indexes by parsing different types of source collections. Please read the format descriptions to understand fully how an index is constructed from the given data. Currently, the following index types are supported:


Usage: zet -i file1 ... fileN

Index construction options

    Put Zettair into index construction mode (as opposed to searching mode).
    file1 ... fileN

file1...fileN 是需要创建索引的文本集,如果没有指定参数可以从stdin中读取。这样可以通过管道符来指定特定的文件名或者通过shell命令也可以。比如
:find . -name "*.c" -or -name "*.h" | ./zet -i -f source_index

    The given files (file1 ... fileN) are files to index for searching. If no files are given then a list of filenames, seperated by whitespace, is read from stdin. This allows you to pipe a list of filenames to index in from a file or shell command. The command:
    find . -name "*.c" -or -name "*.h" | ./zet -i -f source_index

我们也可以通过前缀或者后缀来建立索引,-f --filename prefix
    would find all files with c and h extensions and index them, placing the result into a set of files that start with source_index.
    -f,--filename prefix

    give the name of the index to use. If no name is given, 'index' will be used as the default. The prefix can include directory path components.
    -c,--config config_file

使用如上的配置文件来解析。这个配置中决定了抽取的文本标签。格式是一个简单文本(除去尖角号)指定文档中指定解析时是打开这个还是关闭这个标签。在config/psettings.xml 有样例。
    use this configuration file for the parser. The configuration file determines which tags the parser attempts to extract text from. The format is a simple text file where the name of a tag (minus the angled brackets) is followed by a number that indicates whether parsing should be turned on or off after this tag. See config/psettings.xml for an example.

    causes zettair to use around 500MB of memory during indexing (by default, around 20MB is used)

    allow zettair to add new postings to an existing index. By default, this causes an error.

    --stem{ none | eds | light | porters }
使用词干提取算法在索引构建的过程中,NONE是不进行提取。EDS是会把ed、e、s等去掉。light模式自定义词干提取,虽然高效,但是会稍稍比Porter'词干提取效率低一些。Porter's 词干提取算法是一个慢速、完整的被人熟知的词干提取算法。
    Use given stemming algorithm during index construction. None is no stemming. eds removes 'e', 'ed', and 's'. light is a custom stemmer that is fast, but slightly less effective than Porter's stemming. Porter's stemming is a slow, complex, well-known stemming algorithm.
    Generate impact-ordered inverted lists during construction. This is required to use impact-ordered evaluation during querying.
    -t { TREC | HTML }
    select the type of the index, TREC or HTML (default: autodetect)

Sample Command Line:
zet -i -f disk45 -c /research/zettair/config/parser_settings.trec -t TREC /research/TREC/disk45/fbis /research/TREC/disk45/fr /research/TREC/disk45/ft /research/TREC/disk45/latimes

This command will use the TREC parser to create an inverted index from the four listed files. You should then find the following index files:

1.4.2    索引文档类型
1、HTML Format
The HTML parser treats each file as one document in HTML format. Text is extracted from HTML documents according to the parser settings file, documented above.
2、TREC Format
TREC 格式经常被用来合并成千上万的文档到一个文档中,这样构建索引只需要查找一个文档而不是成千上万的文件。当原始的文件中的边界可以被解析器识别,这样就可以把这些文件合并到一个文件汇中去。解析器会抽取内容从给定的文件中和HTML相类似的方式进行抽取。再就是TREC解析器会查找<DOC>和</DOC>配对的标签来标记文档的开始和结尾,会通过<DOCNO> and </DOCNO>标签来查找文档标号的标签。TREC format 是由国际文本检索大会使用的实验数据格式而出名的。
It is often advantageous to combine several (thousand) documents in one file and be able to index and search on one single file rather than a few thousand files. This can be done by writing the information of several files into one file and formatting the one file in such a way that original document boundaries can be detected by the parser. The parser will extract words from the given file in much the same way as in HTML mode. Additionally, the TREC parser looks for tags: <DOC> and </DOC> to signal the beginning or end of a document, and identifies the documents via their TREC document number, which is found between a <DOCNO> and </DOCNO> tags. The TREC format is named as such because it is the format used by the Text Retrieval Conference (TREC) for experimental data.

The following excerpt from the Bible represents, for instance, 8 documents (of which 4 documents contain only one word).

<DOC> And the sons of Noah, that went forth of the ark, were Shem, and Ham,
and Japheth: and Ham is the father of Canaan. </DOC>
<DOC> genesis </DOC>
<DOC> These are the three sons of Noah: and of them was the whole earth overspread.</DOC>
<DOC> genesis </DOC>
<DOC> And Noah began to be an husbandman, and he planted a vineyard:</DOC>
<DOC> genesis </DOC>
<DOC> And he drank of the wine, and was drunken; and he was uncovered within his tent.</DOC>
<DOC> genesis </DOC>

1.5    Zettair索引查询说明
This page documents how you can use Zettair to query an inverted index. There are two executables that can be used for querying indexes build by Zettair:

zet 用于普通查询
    zet Used for general querying

zet_trec 用于TREC实验查询,输入必须是TREC格式的文件,输出的文件格式可以直接作为trec 评估体系用于评估。

    zet_trec Used for TREC experiments. The input is a TREC topic file, and the output is in a format that can be used with the trec_eval program.
1.5.1    索引度量选项
如下选在在zet和zet_trec 都生效。
These can be used with either zet or zet_trec to change the similarity metric used by Zettair.

BM25 google的一种评分算法模型。
    Use the Okapi BM25 metric.

BM25 google的一种评分算法模型,K1
    Set the k1 parameter for the Okapi BM25 metric to the specified floating point value.

BM25 google的一种评分算法模型,b
    Set the b parameter for the Okapi BM25 metric to the specified floating point value.

BM25 google的一种评分算法模型,K3
    Set the k3 parameter for the Okapi BM25 metric to the specified floating point value.

    Use the pivoted cosine metric, with the pivot provided as a floating point value.
    Use the cosine metric.

    Use Dave Hawking's adaptation of the Okapi BM25 metric, with the alpha value provided as a floating point number.

    Use Anh and Moffat's impact-ordered evaluation, including separate metric. --anh-impact must have been used when building the index in order to employ impact-ordered query evaluation.

    Use the Dirichlet-smoothed, query-likelihood language modelling metric with mu value given as an unsigned integer.
1.5.2    索引查询
Usage: zet [query1 ... queryN]

Index querying options:

    -f prefix
    Give the name of the index to use. If no name is given then 'index' is used by default. The prefix may contain directory path elements.

    -n results
-n 设置响应查询返回的个数,默认是20个。
    Sets the maximum number of results returned in response to each query. The default is 20.

    Instructs Zettair to read queries from the given file, instead of from stdin.

    Uses the words contained in the given filename as stop words (not evaluated) during querying. If no filename is given, a default stop list is loaded.

    Instructs Zettair to use approximately 500MB of memory while querying. The default memory usage should be around 20MB.

    -b first_result
    Sets the number of results to skip for each query. This can be useful in obtaining more results for a query without repeating those already obtained. The default is 0.
--summary={ plain | capitalise | tag | none }
选择展示的摘要文档的类型。nonet是不需要展示摘要,这是默认选项,其他的是如何高亮查找的单词在概要的文档中。plain:不高亮显示。capitalise 把找到的内容转成大写,tag 把找到的结果前后加上<b>的标签。
    Choose the type of document summarisation to perform. none means do not provide document summaries with the query results; this is the default. The other alternatives specify how to highlight the search terms in the summary. plain specifies not to highlight the search terms. capitalise highlights the search terms by capitalising them. tag highlights them by surrounding them with <b> tags.

    query1 ... queryN
query1 到N的模式有点类似于google的输入查询,查询包含的关键词被选择的用AND和OR进行分割,但是操作符必须是大写。默认的操作符是OR,查询是不区分大小写的,除非指定AND和OR。在操作过程中不会进行词干提取和停止。所有的查询结构都在排序后展现出来。注意:google的运算符 - 现在目前的zettair中不支持的。
    For searching, the given queries (query1 ... queryN) are Google-like queries that are used to search the index. Queries consist of keywords and phrases (represented "like this") optionally separated by the operators AND and OR (operators MUST be capitalised). The default operator is OR. Search is case-insensitive, except for recognition of AND and OR. Stopping and stemming are not performed. All results are ranked by relevance. Note that the Google operator '-' and modifiers are not currently supported.
    If no queries are found in the command line, Zettair will start in interactive mode. In this mode queries are read from standard input and executed. Interactive mode exits once it can no longer read from standard input. You can cause it to exit by entering the end-of-file control character, typically control-d.
    print version information
    print a help message

Sample Command Line:
zet -n 10 -f disk45
Example queries:

    mail configuration
    searches for the word 'mail' and the word 'configuration'. Pages returned can have either word, or both (OR query) in upper, lower or mixed case.
    mail AND configuration
    searches for pages that have the words 'mail' and 'configuration' in them.
    shakespeare "to be or not to be"
    searches for the word 'shakespeare' and the phrase 'to be or not to be'.

在命令行查询时可以使用双引号来查询,比如:"This is a query"
Note that if you are entering queries at the command line, you will probably have to escape (using the backslash or other means) double quotes for phrases. e.g.
> zet "this is a query \"with a phrase\""

1.5.3    TREC索引查询
Usage: zet_trec index

TREC querying options:

    Add TREC topic_file to list of topic files to process.
    Add files listed in file to list of topic files to process

    Output run_id as id for this evaluation (run_id is a text field in trec_eval output)

    Number of results to output per query.
    Use topic titles in queries (this is the default if none of -t, -a or -d are specified).

    Use topic descriptions in queries.

    Use topic narratives in queries.

    Print queries to stderr as they are constructed from the topic file and resolved.

    Print the total time taken in querying to stderr after all topics have been resolved. The time printed excludes index loading time.
    Insert dummy entries for topics that have no answers in the results set. This has been required for TREC terabyte submissions in the past.
    Don't stop if a query cannot be constructed from a topic. Useful when running large, noisy query logs.
    Uses the words contained in the given filename as stop words (not evaluated) during querying. If no filename is given, a default stop list is loaded.
可以让Zettair 使用约500M的内存当查询时,默认是20MB。
    Instructs Zettair to use approximately 500MB of memory while querying. The default memory usage should be around 20MB.
不是以TREC的格式输出找到的结果,而是按照评估使用的Qrels file文件。在TREC Qrel 格式,输出的内容格式可用于trec 评估。
    Instead of printing search results in TREC format, the results are evaluated against the given Qrels file, in TREC Qrel format, and trec_eval-like output is produced.
    The name of the index that is queried using the TREC topic files.
    Print help message
    Print version information

Sample Command Line:
./zet_trec -f /research/TREC-7/topics.351-400 -n 1000 disk45 > query.log

query.log 可以用于trec_eval的评估体系。
The file query.log can then be evaluated with trec_eval against pre-prepared relevance judgements.

1.6    Zettair代码导读
Zettair is designed to be as clean, simple, flexible and fast as possible. While it is currently a work in progress, Zettair handles simple searching quite well, and has sufficient architecture to be extended in many different directions.

核心顶级的方法的查询的源码被分割到不同的文件中。核心函数方法在index.h 中进行了定义。
The core search code is seperated from the different front-end access methods. The core searching methods are documented in include/index.h.

Zettair也有一些编译配置选项和长度限制,这些可以在 ./src/include/def.h 中被找到,尽管默认配置就可以满足大多数人的要求。
Zettair also has a number of compile-time configuration options and length limitations. These can all be found and changed in src/include/def.h, although the default settings should suffice for most people.

除了阅读代码之外,你想了解任何信息都可发送邮件到 zettair@cs.rmit.edu.au 。
Apart from reading the source code, if you want to know more about any part of Zettair feel free to contact us at zettair@cs.rmit.edu.au.

1.7    附录
如果您对如上的翻译文档 ,有什么意见和看法也欢迎发送Email到 yijiyong100@163.com 和我进行交流。
