工作快两年了,今天经理又把去年的那个regain的检索拿出来,让以最快的速度整理好,让跑起来。呵呵,记得刚接触的时候自己还是个刚离开校园的毛头小子,捣鼓了一个月没弄好,最后让给经理了。现在拿到手里,又有时间就自己把里面的配置文件翻译一下:
其实主要有连个配置文件:CrawlerConfiguration.xml(建索引时使用),SearchConfiguration.xml(搜索索引时使用)
下载网址http://regain.sourceforge.net/download.php
CrawlerConfiguration.xml
<?xml version="1.0" encoding="GBK"?>
<!DOCTYPE configuration [
<!ENTITY amp "&">
<!ENTITY lt "<">
<!ENTITY minus "-">
]>
<!--
| Configuration for the regain crawler (for creating a search index)
|翻译:为regain爬虫准备的配置文件,该配置文件用来创建查询索引
| You can find a detailed description of all configuration tags here:
| http://regain.murfman.de/wiki/en/index.php/CrawlerConfiguration.xml
|翻译:你可以在下列网址中找到详细的关于该配置中所有标签的描述文件,http://regain.murfman.de/wiki/en/index.php/CrawlerConfiguration.xml
| You can find more configration examples in the CrawlerConfiguration_examples.xml.
|翻译:你也可以在CrawlerConfiguration_examples.xml.文件中找到更多的例子
+-->
<configuration>
<!--
| Enter your HTTP proxy settings here (Look at the preferences of your browser)
|翻译:在这里输入你的http代理,可以查看的你的浏览器操作参数
+-->
<proxy>
<!--
<host>proxy</host>
<port>3128</port>
<user>HansWurst</user>
<password>gkxy23</password>
-->
</proxy>
<!--
| The list of URLs where the spidering will start.
|翻译:spidering开始查找资料的URLs列表
| Enter the start page of your web site resp. a file system folder here.
|翻译:输入你的web地址,spidering将从这里开始。这里是一个系统文件夹
| NOTE: The examples are in a comment. Thus, if you add your path in one of
| them, then don't forget to uncomment them.
|翻译:注意例子中都有注释,所以如果在例子中添加了自己的路径,记住做标记
+-->
<startlist>
<!-- Directory parsing 目录解析-->
<!--
<start parse="true" index="false">file://c:/Eigene Dateien</start>
set the place where the document to located
翻译:设置一个文件下载存放的位置
file://E:/eclipse 3.2/workspace/SIS/WebRoot/FileDepository ${SEARCHDIR}
-->
<start index="false" parse="true">file://${WORKDIR}FileDepository</start>
<!-- HTML parsing -->
<!--
<start parse="true" index="true">http://www.mydomain.de/some/path/</start>
-->
</startlist>
<!--
| The whitelist containing prefixes an URL must have to be processed
|翻译:白名单包含一个URL必须处理的前缀
| Enter the domain of your web site here.
|翻译:在这里键入web地址
+-->
<whitelist>
<prefix>file://</prefix>
</whitelist>
<!--
| The blacklist containing prefixes an URL must NOT have to be processed
|翻译:黑名单列举了后缀一个URL不要处理的前缀
| Enter sub directories you don't want to be indexed here.
|翻译:在这里键入你不希望被索引的地址
+-->
<blacklist>
<!--
<prefix>http://www.mydomain.de/some/dynamic/content/</prefix>
<regex>/backup/[^/]*$</regex>
-->
</blacklist>
<!--
| ==================================================================================
| That's all you have to configure! The rest of this file is advanced configuration.
|翻译:以上是所有需要配置的地方,这个文件中下面的部分是高级配置
| ==================================================================================
+-->
<!--
| The preferences for the search index.
|翻译:查询索引参数
+-->
<searchIndex>
<!--
The directory where the index should be located ${SEARCHDIR}
翻译:索引应该被放置的目录
-->
<dir>${SEARCHDIR}searchindex</dir>
<!--
| Specifies the analyzer type to use.
| 翻译:指定分析机类型以便使用
| You may specify the class name of the analyzer or you use one of the
| following aliases:
| * english: For the english language
| (alias for org.apache.lucene.analysis.standard.StandardAnalyzer)
| * german: For the german language
| (alias for org.apache.lucene.analysis.de.GermanAnalyzer)
| 翻译:你可以指定分析机的类名,也可以任意选取下面的别名中的一个
| english:针对英文环境,是org.apache.lucene.analysis.standard.StandardAnalyzer的别名
| german:针对德文环境,是org.apache.lucene.analysis.de.GermanAnalyzer的别名
+-->
<analyzerType>english</analyzerType>
<!--
<analyzerType>german</analyzerType>
<analyzerType>chinese</analyzerType>
<analyzerType>paoding</analyzerType>
-->
<!--
| Contains all words that should not be indexed.
| Separate the words by a blank.
|翻译:包含了所有的不必被索引的单词,把这些单词用空白分开
+-->
<stopwordList>
einer eine eines einem einen der die das dass da?du er sie es was wer wie
wir und oder ohne mit am im in aus auf ist sein war wird ihr ihre ihres als
für von mit dich dir mich mir mein sein kein durch wegen wird
</stopwordList>
<!-- italian:
<stopwordList>
di a da in con su per tra fra io tu egli ella essa noi voi essi loro che cui
se e n?anche inoltre neanche o ovvero oppure ma per?eppure anzi invece
bens?tuttavia quindi dunque perci?pertanto cio?infatti ossia non come
mentre perch?quando mio mia miei mie tuo tua tuoi tue suo sua suoi sue
nostro nostre nostri nostre vostro vostre vostri vostre il lo la i gli le un
uno una degli delle alcuno alcuna alcune qualcuno qualcuna nessuno nessuna
molto molte molti molte poco parecchio assai
</stopwordList>
-->
<!--
| Contains all words that should not be changed by an analyser when indexed.
| Separate the words by a blank.
|翻译:包含所有的被分析机索引时不应该改变的内容。把这些单词用空白分开
+-->
<exclusionList></exclusionList>
<!--
| The names of the fields of which to prefetch the destinct values.
| Separate the field names by a blank.
|翻译:
| Put in the names of the fields you use a search:input_fieldlist tag for.
| The values shown in the list will then be extracted by the crawler and not
| by the search mask, which prevents a slow first loading of a page for huge
| indexes.
|翻译:放置用来查询的字段名称,在列表中列举的值将被爬虫提取出来,但是不会被查询到,这些值阻止了页面第一次加载更多的索引
+-->
<valuePrefetchFields>mimetype</valuePrefetchFields>
<!--
| Specifies wether the whole content should be stored in the index for the
| purpose of a content preview
|翻译:指定为了能够预览内容是否所有内容需要被存储在索引中。
+-->
<storeContentForPreview>true</storeContentForPreview>
</searchIndex>
<!--
| The preparators in the order they should be applied. Preparators that aren't listed
| here will be applied after the listed ones.
|翻译:在序列中列举的preparators需要被应用,没有被列举的将在列举的后面被应用
| You can use this list...
| ... to define the priority (= order) of the preparators
| ... to disable preparators
| ... to configure preparators
|翻译:该属性有如下用途:
| ... 定义preparators的属性(= order)
| ... 禁用preparators
| ... 配置preparators
+-->
<preparatorList>
<!--
| Enable this preparator if you want to use the text extractor of
| Microsoft Windows. This preparator is able to read tons of file formats.
|翻译:如果你想应用这个提取的text文字,就使用preparator,preparator可以读取文件格式
| NOTE: Under Windows 2000 you have to make sure that reg.exe is installed
| (It's part of the "Support Tools").
| For details see: http://support.microsoft.com/kb/301423
|翻译:注意在windows2000以下的版本中,你需要确保安装了reg.exe(reg.exe是一个支持工具);
|详细资料可以参考网址 http://support.microsoft.com/kb/301423
+-->
<preparator enabled="false">
<class>.IfilterPreparator</class>
</preparator>
<!--
| Enable this preparator if you want to use MS Excel for indexing your Excel
| documents.
|翻译:如果您要索引Excel格式文件内容,那么就使用preparator
+-->
<preparator enabled="false">
<class>.JacobMsExcelPreparator</class>
</preparator>
<!--
| Enable this preparator if you want to use MS Word for indexing your Word
| documents.
|翻译:如果您要索引Word格式文件内容,那么就使用preparator
+-->
<preparator enabled="false">
<class>.JacobMsWordPreparator</class>
</preparator>
<!--
| Enable this preparator if you want to use MS Powerpoint for indexing your
| Powerpoint documents.
|翻译:如果您要索引Powerpoint格式文件内容,那么就使用preparator
+-->
<preparator enabled="false">
<class>.JacobMsPowerPointPreparator</class>
</preparator>
<!--
| This tells regain that it should first try the SimpleRtfPreparator for RTF
| files. Only if this one fails the SwingRtfPreparator is used
| (which is much slower).
|翻译:下面用来通知regain,首先使用SimpleRtfPreparator,只用当SimpleRtfPreparator失败了才使用SwingRtfPreparator
|SwingRtfPreparator必须延迟。
+-->
<preparator>
<class>.SimpleRtfPreparator</class>
</preparator>
<preparator>
<class>.SwingRtfPreparator</class>
</preparator>
<!--
| This preparator may be used if you have an external program that can
| extract text. It's disabled by default.
|翻译:如果你有一个可以提取text的外部项目,下面的preparator可以使用,默认情况下他是被禁用的
+-->
<preparator enabled="false">
<class>.ExternalPreparator</class>
<config>
<section name="command">
<param name="urlPattern">\.ps$</param>
<param name="commandLine">ps2ascii ${filename}</param>
<param name="checkExitCode">false</param>
</section>
</config>
</preparator>
<!--
CatchAll-preparator on basis of EmptyPreparator
翻译:在EmptyPreparator中缓存所有的preparator
-->
<preparator priority="-10">
<class>.EmptyPreparator</class>
<urlPattern>.*</urlPattern>
</preparator>
</preparatorList>
<!--
| The index may be extended with auxiliary fields. These are fields that have
| been generated from the URL of an document.
| 翻译:通过辅助域索引可以扩充,这里有通过一个文档的url产生的字段。
| Example: If you have a directory with a sub directory for every project,
| then you may create a field with the project's name.
| 翻译:例如:有这样一种情况,现在有一个所有项目都有子目录的目录,这时你就会用这个项目的名称产生一个字段
| The folling tag will create a field "project" with the value "otto23"
| from the URL "file://c:/projects/otto23/docs/Spez.doc":
|翻译:下面的标签将从地址为"file://c:/projects/otto23/docs/Spez.doc"的url中
| 产生一个名称为"project",值为"otto23"的字段
| <auxiliaryField name="project" regexGroup="1">
| <regex>^file://c:/projects/([^/]*)</regex>
| </auxiliaryField>
|
| URLs that doen't match will get no "project" field.
|翻译:URLs不匹配的,将不能得到"project"字段。
| Having done this you may search for "Offer project:otto23" and you will get
| only hits from this project directory.
|翻译:假设已经做了这些,你也许会查询"Offer project:otto23",这样你将只从该project目录获得结果集
+-->
<auxiliaryFieldList>
<!--
Don't change these two fields. But you may add your own.
翻译:不要更改这两个字段,但是你可以增加属于自己的条件。
-->
<auxiliaryField name="extension" regexGroup="1" toLowercase="true">
<regex>\.([^\.]*)$</regex>
</auxiliaryField>
<auxiliaryField name="location" regexGroup="1" store="false" tokenize="true">
<regex>^(.*)$</regex>
</auxiliaryField>
<auxiliaryField name="mimetype" regexGroup="1" >
<regex>^()$</regex>
</auxiliaryField>
</auxiliaryFieldList>
<!-- The regular expressions that indentify URLs in HTML. -->
<!-- This configuration part is no longer neccessary -->
<!--htmlParserPatternList>
<pattern parse="true" index="true" regexGroup="1">="([^"]*(/|htm|html|jsp|php\d?|asp))"</pattern>
<pattern parse="false" index="false" regexGroup="1">="([^"]*\.(js|css|jpg|gif|png))"</pattern>
<pattern parse="false" index="true" regexGroup="1">="([^"]*\.[^\."]{3})"</pattern>
</htmlParserPatternList-->
</configuration>
下面是SearchConfiguration.xml
<?xml version="1.0" encoding="GBK"?>
<!DOCTYPE configuration [
<!ENTITY amp "&">
<!ENTITY lt "<">
]>
<!--
| Configuration for the regain search mask.
|翻译:regain search 的配置文件
|
| Normally you only have to specify the directory where the search index is
| located. You do this in the <dir> tag of the <index name="main"> (line 74).
|翻译:一般的您只需要指定查询索引所在的目录就可以了,在这个配置文件中你在 <index name="main">标签下的
|<dir> 目录中指定
| You can find a detailed description of all configuration tags here:
|翻译:你可以在下面的这个网址中找到所有的配置标签的详细的说明
| http://regain.murfman.de/wiki/en/index.php/SearchConfiguration.xml
+-->
<configuration>
<!-- The search indexes 查询索引-->
<indexList>
<!--
| All settings defined in this section are applied to all indexes unless
|翻译: 所有的在section中定义的设置被应用于所有的索引中,除非设置被重新定义
| they redefine the setting.
+-->
<defaultSettings>
<!--
1 <defaultSettings>: The cascaded default settings
2<index>: The settings for one index.
-->
<!--
| The regular expression that identifies URLs that should be opened in
| a new window.
| 翻译:在一个新窗口中打开的规则的整齐的标时urls的表达式
+-->
<openInNewWindowRegex>.(pdf|rtf|doc|xls|ppt)$</openInNewWindowRegex>
<!--
| Specifies whether the file-to-http-bridge should be used for file-URLs.
|翻译:指定file-to-http-bridge是否被用于file-URLs
| Mozilla browsers have a security mechanism that blocks loading file-URLs
|翻译:Mozilla浏览器有一个安全机制,他限制从已经下载的http页面中下载 file-URLs
| from pages loaded via http. To be able to load files from the search
| results, regain offers the file-to-http-bridge that provides all files that
| are listed in the index via http.
|翻译:为了实现从查询结果中下载文件,file-to-http-bridge是regain提供的,是提供给所有的通过http在索引中列举的文件
+-->
<useFileToHttpBridge>true</useFileToHttpBridge>
<!--
| The index fields to search by default.
|翻译:默认的查询索引字段
| NOTE: The user may search in other fields also using the
| "field:"-operator. Read the lucene query syntax for details:
| http://jakarta.apache.org/lucene/docs/queryparsersyntax.html
|翻译:注意:用户在其他域中也许用"field:"-operator;请阅读lucene查询句法详细了解
|网址是:http://jakarta.apache.org/lucene/docs/queryparsersyntax.html
+-->
<searchFieldList>content title headlines location filename</searchFieldList>
<!--
| The SearchAccessController to use.
| 翻译:应用查询访问控制器
| This is a part of the access control system that ensures that only those
| documents are shown in the search results that the user is allowed to
| read.
|翻译:访问控制系统的一部分,这部分的作用是保证只有用户允许阅读的文件出现在查询结果中
| If you specify a SearchAccessController, don't forget to specify the
| CrawlerAccessController counterpart in the CrawlerConfiguration.xml!
|翻译:如果您要指定SearchAccessController(查询访问控制器),请确定修改CrawlerConfiguration.xml
|中的爬虫反问控制器对应的字段。
+-->
<!--
<searchAccessController>
<class jar="myAccess.jar">mypackage.MySearchAccessController</class>
<config>
<param name="bla">blubb</param>
</config>
</searchAccessController>
-->
<!--
|
| Specifies whether the search terms should by highlighted whithin the
| search results (summary, title)
|翻译:指定在查询结果(summary, title)中,查询部分需要被高亮显示
+-->
<Highlighting>true</Highlighting>
</defaultSettings>
<!-- The search index 'main' 查询索引'main' -->
<index name="main" default="true" isparent="true">
<!--
The directory where the index is located
翻译:索引存放的位置
-->
<dir>${SEARCHDIR}searchindex</dir>
</index>
<!--
| A child index of 'main'
|翻译:子索引存放的位置
+-->
<!--
<index name="main1" default="true" isparent="false" parent="main">
<dir>searchindex_1</dir>
</index>
-->
<!-- The search index 'example' 查询索引'example' 例子-->
<index name="example">
<!-- The directory where the index is located 索引存放的目录-->
<dir>c:\Temp\searchindex_example</dir>
<rewriteRules>
<rule prefix="file://c:/example/www-data" replacement="http://www.mydomain.de"/>
</rewriteRules>
</index>
</indexList>
</configuration>