swish-e代码分析，索引部分（4）

陆才俊

2023-12-01

从本节开始，对于核心索引过程进行描述。

2.3 核心索引过程

对于每个需要索引的文件，先初始化一个FileProp结构，然后读取文件内容，解析词条等等。

2.3.1 FileProp函数

每个 file 都通过 file_properties 函数生成 FileProp 结构，保存了文件的路径、大小、文档类型等。

备注：

如果在配置文件中没有初始化文档的类型，则默认为 HTML 类型。我们在配置文件中设置了 IndexContent TXT .txt 类型。

2.3.2 do_index_file 函数

/*********************************************************************** -- Start the real indexing process for a file. -- This routine will be called by the different indexing methods -- (httpd, filesystem, etc.) -- The indexed file may be the -- - real file on filesystem -- - tmpfile or work file (shadow of the real file) -- Checks if file has to be send thru filter (file stream) -- 2000-11-19 rasc ***********************************************************************/ void do_index_file(SWISH * sw, FileProp * fprop) { int (*countwords)(SWISH *sw,FileProp *fprop, FileRec *fi, char *buffer); 注意函数指针的应用； /* -- Read all data, last 1 is flag that we are expecting text only */ rd_buffer = read_stream(sw, fprop, 1); /* just for fun so we can show total bytes shown */ sw->indexlist->total_bytes += fprop->fsize; /* Set which parser to use */ switch (fprop->doctype) { case TXT: strcpy(strType,"TXT"); countwords = countwords_TXT; break; ---------------------------- /* Now bump the file counter */ idx->filenum++; indexf->header.totalfiles++; fi.filenum = idx->filenum; /** PARSE **/ wordcount = countwords(sw, fprop, &fi, rd_buffer);

do_index_file 代码片段（ 1 ）

通过read_stream 读取文件内容；

判断 FileProc 变量中的 doctype 属性，根据不同的文件类型， countwords 函数指针指向不同的函数；

Mod_Index 变量中的文件数目增加；

通过 countwords 进行索引，我们以 countwords_TXT 为例。

do_index_file 代码分析

在配置文件中，我们设定了文档类型为 TXT ，所以函数指针指向 countwords_TXT 函数（在文件txt.c中)，利用 indexstring 进行文档的词条解析。

备注：

此时为 TXT 类型（最简单的类型）， IN_FILE 为 1 ，即：内容只是在文本当中的。

对于 CommonProperties 进行处理

metaID=1; positionMeta=1; /* No metanames in TXT */

return indexstring(sw, buffer, fi->filenum, IN_FILE, 1, &metaID, &positionMeta);

由于是 TXT 格式，对于 metaID 设置为 1 （没有 metaID ）

count_words_TXT 代码片段

indexstring函数通过调用next_word读取词条，进行一些判断后（是否是stopword等)，通过addword函数将词条加入到hash表中，而addword调用了addentry核心函数。

static void addword( char *word, SWISH * sw, int filenum, int structure, int numMetaNames, int *metaID, int *word_position)

{

int i;

/* Add the word for each nested metaname. */

for (i = 0; i < numMetaNames; i++)

(void) addentry(sw, getentry(sw,word), filenum, structure, metaID[i], *word_position);

(*word_position)++;

}

2.3.3 getentry查找词条函数

在将词条entry加入到hash表之前，需要先判断是否已经含有了该词条，通过getentry处理。

ENTRY *getentry(SWISH * sw, char *word)

{

IndexFILE *indexf = sw->indexlist;

struct MOD_Index *idx = sw->Index;

int hashval;

ENTRY *e;

if (!idx->entryArray)

{

idx->entryArray = (ENTRYARRAY *) emalloc(sizeof(ENTRYARRAY));

idx->entryArray->numWords = 0;

idx->entryArray->elist = NULL;

}

/* Compute hash value of word */

hashval = verybighash(word);

/* Look for the word in the hash array */

for (e = idx->hashentries[hashval]; e; e = e->next)

if (strcmp(e->word, word) == 0)

break;

/* flag hash entry used this file, so that the locations can be "compressed" in do_index_file */

idx->hashentriesdirty[hashval] = 1;

/* Word found, return it */

if (e)

return e;

/* Word not found, so create a new word */

e = (ENTRY *) Mem_ZoneAlloc(idx->entryZone, sizeof(ENTRY) + strlen(word));

strcpy(e->word, word);

e->next = idx->hashentries[hashval];

idx->hashentries[hashval] = e;

/* Init values */

e->tfrequency = 0;

e->u1.last_filenum = 0;

e->currentlocation = NULL;

e->currentChunkLocationList = NULL;

e->allLocationList = NULL;

idx->entryArray->numWords++;

indexf->header.totalwords++;

return e;

getentry 代码片段

建立 entryArray 数组（后面会将全部解析好的词条 hash 表放入到数组中，然后再进行数组的排序）；

计算 word 的 hash 值；

在 idx 中的 hashentries Hash 表中查找该 word ，并设置 hashentriesdirty 数组的标志位；

如果发现，直接返回；建立 entry ；

如果没有发现，新建 entry 变量，加入到 hashentries 表中，并初始化 entry 频率、位置 location 、以及 idx 中 totalWord ；

getentry 代码分析

swish-e代码分析，索引部分（4）

相关阅读

相关文章

相关问答

相关文档