regain 安装
娄嘉石
2023-12-01
一、修改增加中文分词模块为 Paoding-analysis
非常简单,只需要修改一个源码文件。
源代码文件(以下都用下划线表示):src\net\sf\regainRegainToolKit.java
import net.paoding.analysis.analyzer.PaodingAnalyzer;
import org.apache.lucene.analysis.cn.ChineseAnalyzer;
public static Analyzer createAnalyzer(String analyzerType,
String[] stopWordList, String[] exclusionList, String[] untokenizedFieldNames)
throws RegainException
if (analyzerType.equalsIgnoreCase("english")) {
analyzerClassName = StandardAnalyzer.class.getName();
} else if (analyzerType.equalsIgnoreCase("german")) {
analyzerClassName = GermanAnalyzer.class.getName();
} else if (analyzerType.equalsIgnoreCase("chinese")){
analyzerClassName = ChineseAnalyzer.class.getName();//Add by ping.
} else if (analyzerType.equalsIgnoreCase("paoding")){
analyzerClassName = PaodingAnalyzer.class.getName();//Add by ping.
}
源码修改只涉及以上一个文件,但是要完整编译和最终运成功,还需要其他修改。
主要包括:
1.修改ant的编译配置文件build.xml,
2.拷贝paoding-analysis.jar到lib目录。
build.xml修改如下:
[这里摘录修改的片段,修改增加部分为粗体]
...
<target name="runtime-desktop" depends="prepare-once, runtime-desktop-fast">
<echo message="Creating the jars ..." />
<fileset id="desktop-common-jars" dir="build/included-lib-classes/common">
<include name="org/apache/lucene/**"/>
<include name="org/apache/log4j/**"/>
<include name="org/apache/regexp/**"/>
<!-- Add by ping. -->
<include name="net/paoding/analysis/**"/>
<include name="paoding-*.properties"/>
<include name="org/apache/commons/**"/>
...
<target name="runtime-server" depends="prepare-once, runtime-server-fast, -web-temps">
<jar jarfile="build/runtime/crawler/${programname.file}-crawler.jar"
compress="false"
index="true">
<manifest>
<attribute name="Main-Class" value="net.sf.regain.crawler.Main"/>
</manifest>
<fileset dir="build/included-lib-classes/common">
<include name="org/apache/lucene/**"/>
<include name="org/apache/log4j/**"/>
<include name="org/apache/regexp/**"/>
<!-- Add by ping. -->
<include name="net/paoding/analysis/**"/>
<include name="paoding-*.properties"/>
<include name="org/apache/commons/**"/>
...
<mkdir dir="build/runtime/search/webapps"/>
<war destfile="build/runtime/search/webapps/${programname.file}.war"
webxml="web/server/web-inf/web.xml">
<classes dir="build/classes">
<exclude name="net/sf/regain/crawler/**"/>
<exclude name="net/sf/regain/ui/desktop/**"/>
<exclude name="net/sf/regain/util/sharedtag/simple/**"/>
<exclude name="net/sf/regain/util/ui/**"/>
</classes>
<lib dir="lib">
<include name="lucene-*.jar"/>
<include name="jakarta-regexp-*.jar"/>
<include name="log4j-*.jar"/>
<!--Add by ping.-->
<include name="paoding-*.jar"/>
<include name="commons-logging*.jar"/>
</lib>
...
<mkdir dir="${deploy-target.dir}/${programname.file}/WEB-INF/lib"/>
<copy todir="${deploy-target.dir}/${programname.file}/WEB-INF/lib">
<fileset dir="lib">
<include name="lucene-*.jar"/>
<include name="jakarta-regexp-*.jar"/>
<include name="log4j-*.jar"/>
<!--Add by ping.-->
<include name="paoding-*.jar"/>
<include name="commons-logging*.jar"/>
</fileset>
</copy>
二、修改查询结果片段长度
1.默认查询结果显示片段为100个字节,
个人认为比较短,可以修改为结果片段长度为300.
lucene\contrib\highlighter\src\java
org.apache.lucene.search.highlight
SimpleFragmenter.java
public class SimpleFragmenter implements Fragmenter
{
private static final int DEFAULT_FRAGMENT_SIZE =100*3;
定于查询结果片段的长度。默认为100字节,修改为300字节
三、另外,对查询结果页面进行稍微修改。
1.package net.sf.regain.search.results;
SingleSearchRusults.jsp
public void highlightHitDocument(int index)
resHighlSummary = highlighter.getBestFragments(tokenStream, text, 3,
" . . . . . . <br><span class=\"resultTag\">[Result]</span> ");
定于查询结果显示。
2.web\web\common
search.jsp
<search:list msgNoResults="<tr><td colspan='2'>{msg:noResultsFound}<br/><br/></td></tr>">
<tr><td colspan="2">
<search:hit_typeicon imgpath="img/ext"/> <search:hit_link/>
<span class="hitDetails">
(<search:msg key="relevance"/>: <search:hit_score/>)<br/>
<span class="resultTag">[Result]</span>
<search:hit_field field="summary"/><br/>
<search:hit_content/>
<search:hit_path after="<br/>" createLinks="true"/>
<search:hit_field field="mimetype"/>
<span class="hitInfo"><search:hit_url beautified="true"/> - <search:hit_size/></span><br/>
<br/></span>
</td></tr>
</search:list>
查询结果显示页面和显示数据域的定义。
3.增加显示样式
src\web\common
regain.css
.resultTag {
color: #0000FF;
font-weight: bold;
}
4.一点小修饰,获取文章内容的按钮默认是德文,翻译成英文表示。
src/net/sf/regain/search/sharedlib/hit/ContentTag.java
protected void printEndTag(PageRequest request, PageResponse response,
Document hit, int hitIndex)
throws RegainException {
String content = null;
content = hit.get("content");
if (content != null) {
String hitNumber = Integer.toString(hitIndex + 1);
response.print("<input type=\"button\" class=\"button\" οnclick=\"return toggleMe('hit_" +
hitNumber + "')\" value=\"Click here Get " + hitNumber + " content\">");
property文件
词典库文件
编码问题
regain增加paoding中文分词以及server端版本设置
原文来自:http://monner.iteye.com/blog/254804
———————————————————————-
补充:
用paoding中文分词,先建立词典
vi /etc/profile
export PAODING_DIC_HOME=/data/paoding/dic
将paoding的dic目录里的内容copy到 /data/paoding/dic
windows设置见手册
另外导入lucene/contrib/memory下的包lucene-memory到regain/lib中.再编译.
server版本中有个问题需要修改.如果出现乱码可尝试将
src/net/sf/regain/search/SearchToolkit.java
修改为下面的
queryString = query.toString().trim();
//add by robin
try {
queryString = new String(queryString.getBytes(”iso-8859-1″),”UTF-8″);
} catch (Exception e) {
}
request.setContextAttribute(SEARCH_QUERY_CONTEXT_ATTR_NAME, queryString);
}
return queryString;
——————————-
regain的服务器版本端配置关键修改点
在
file:///home/admin/domains/25q.net/
然后在
file:///home/admin/domains/25q.net/
这里两处路径都需要加.否则会导致 index empty的错误
原文部分内容:
一、修改增加中文分词模块为 Paoding-analysis
非常简单,只需要修改一个源码文件。
源代码文件(以下都用下划线表示):src\net\sf\regainRegainToolKit.java
import net.paoding.analysis.analyzer.PaodingAnalyzer;
import org.apache.lucene.analysis.cn.ChineseAnalyzer;
public static Analyzer createAnalyzer(String analyzerType,
String[] stopWordList, String[] exclusionList, String[] untokenizedFieldNames)
throws RegainException
if (analyzerType.equalsIgnoreCase(”english”)) {
analyzerClassName = StandardAnalyzer.class.getName();
} else if (analyzerType.equalsIgnoreCase(”german”)) {
analyzerClassName = GermanAnalyzer.class.getName();
} else if (analyzerType.equalsIgnoreCase(”chinese”)){
analyzerClassName = ChineseAnalyzer.class.getName();//Add by ping.
} else if (analyzerType.equalsIgnoreCase(”paoding”)){
analyzerClassName = PaodingAnalyzer.class.getName();//Add by ping.
}
源码修改只涉及以上一个文件,但是要完整编译和最终运成功,还需要其他修改。
主要包括:
1.修改ant的编译配置文件build.xml,
2.拷贝paoding-analysis.jar到lib目录。
build.xml修改见原文地址