当前位置: 首页 > 工具软件 > Regain > 使用案例 >

Regain增加中文Paoding分词模块及界面显示的修改笔记

干京
2023-12-01
Regain修改笔记
 
 
一、修改增加中文分词模块为 Paoding-analysis
 
非常简单,只需要修改一个源码文件。
 
源代码文件(以下都用下划线表示):src\net\sf\regain RegainToolKit.java

import net.paoding.analysis.analyzer.PaodingAnalyzer;
import org.apache.lucene.analysis.cn.ChineseAnalyzer;
    
 
  public static Analyzer createAnalyzer(String analyzerType,
    String[] stopWordList, String[] exclusionList, String[] untokenizedFieldNames)
    throws RegainException
 
    if (analyzerType.equalsIgnoreCase("english")) {
      analyzerClassName = StandardAnalyzer.class.getName();
    } else if (analyzerType.equalsIgnoreCase("german")) {
      analyzerClassName = GermanAnalyzer.class.getName();
    } else if (analyzerType.equalsIgnoreCase("chinese")){
      analyzerClassName = ChineseAnalyzer.class.getName();//Add by ping. 
    }  else if (analyzerType.equalsIgnoreCase("paoding")){
      analyzerClassName = PaodingAnalyzer.class.getName();//Add by ping. 
    }
 
源码修改只涉及以上一个文件,但是要完整编译和最终运成功,还需要其他修改。
主要包括:
1.修改ant的编译配置文件build.xml,
2.拷贝paoding-analysis.jar到lib目录。
 
build.xml修改如下:
[这里摘录修改的片段,修改增加部分为 粗体]
...
  <target name="runtime-desktop" depends="prepare-once, runtime-desktop-fast">
    <echo message="Creating the jars ..." />
    <fileset id="desktop-common-jars" dir="build/included-lib-classes/common">
      <include name="org/apache/lucene/**"/>
      <include name="org/apache/log4j/**"/>
      <include name="org/apache/regexp/**"/>
       <!-- Add by ping. -->
      <include name="net/paoding/analysis/**"/>
      <include name="paoding-*.properties"/>
      <include name="org/apache/commons/**"/>
 
...
  <target name="runtime-server" depends="prepare-once, runtime-server-fast, -web-temps">
    <jar jarfile="build/runtime/crawler/${programname.file}-crawler.jar"
         compress="false"
         index="true">
      <manifest>
        <attribute name="Main-Class" value="net.sf.regain.crawler.Main"/>
      </manifest>
      <fileset dir="build/included-lib-classes/common">
        <include name="org/apache/lucene/**"/>
        <include name="org/apache/log4j/**"/>
        <include name="org/apache/regexp/**"/>
 
      <!-- Add by ping. -->
      <include name="net/paoding/analysis/**"/>
      <include name="paoding-*.properties"/>
      <include name="org/apache/commons/**"/>
...
 
    <mkdir dir="build/runtime/search/webapps"/>
    <war destfile="build/runtime/search/webapps/${programname.file}.war"
         webxml="web/server/web-inf/web.xml">
      <classes dir="build/classes">
        <exclude name="net/sf/regain/crawler/**"/>
        <exclude name="net/sf/regain/ui/desktop/**"/>
        <exclude name="net/sf/regain/util/sharedtag/simple/**"/>
        <exclude name="net/sf/regain/util/ui/**"/>
      </classes>
      <lib dir="lib">
        <include name="lucene-*.jar"/>
        <include name="jakarta-regexp-*.jar"/>
        <include name="log4j-*.jar"/>
        <!--Add by ping.-->
        <include name="paoding-*.jar"/>       
        <include name="commons-logging*.jar"/>
       
      </lib>
 
...
    <mkdir dir="${deploy-target.dir}/${programname.file}/WEB-INF/lib"/>
 <copy todir="${deploy-target.dir}/${programname.file}/WEB-INF/lib">
   <fileset dir="lib">
        <include name="lucene-*.jar"/>
        <include name="jakarta-regexp-*.jar"/>
        <include name="log4j-*.jar"/>
        <!--Add by ping.-->
        <include name="paoding-*.jar"/>       
        <include name="commons-logging*.jar"/>
      
    </fileset>
 </copy>
 
 
二、修改查询结果片段长度
 
 
1.默认查询结果显示片段为100个字节,
个人认为比较短,可以修改为结果片段长度为300.
 
lucene\contrib\highlighter\src\java
    org.apache.lucene.search.highlight
        SimpleFragmenter.java
 
public class SimpleFragmenter implements Fragmenter
{
  private static final int DEFAULT_FRAGMENT_SIZE =100*3;
定于查询结果片段的长度。默认为100字节,修改为300字节
 
 
 
三、另外,对查询结果页面进行稍微修改。
 
1.package net.sf.regain.search.results;
SingleSearchRusults.jsp
 
     public void highlightHitDocument(int index)
            resHighlSummary = highlighter.getBestFragments(tokenStream, text, 3,
 " . . .  . . . <br><span class=\"resultTag\">[Result]</span> ");
 定于查询结果显示。
 
2.web\web\common
    search.jsp
 
      <search:list msgNoResults="<tr><td colspan='2'>{msg:noResultsFound}<br/><br/></td></tr>">
        <tr><td colspan="2">
            <search:hit_typeicon imgpath="img/ext"/> <search:hit_link/>
            <span class="hitDetails">
              (<search:msg key="relevance"/>: <search:hit_score/>)<br/>
            <span class="resultTag">[Result]</span>
              <search:hit_field field="summary"/><br/>
              <search:hit_content/>
              <search:hit_path after="<br/>" createLinks="true"/>
              <search:hit_field field="mimetype"/>&nbsp;
              <span class="hitInfo"><search:hit_url beautified="true"/> - <search:hit_size/></span><br/>
            <br/></span>
        </td></tr>
      </search:list>     
    
     查询结果显示页面和显示数据域的定义。
 
 
3.增加显示样式
src\web\common
    regain.css
 
.resultTag {
 color: #0000FF;
 font-weight: bold;
}
 
4.一点小修饰,获取文章内容的按钮默认是德文,翻译成英文表示。
src/net/sf/regain/search/sharedlib/hit/ContentTag.java
  protected void printEndTag(PageRequest request, PageResponse response,
    Document hit, int hitIndex)
    throws RegainException {
 
    String content = null;
    content = hit.get("content");
    if (content != null) {
      String hitNumber = Integer.toString(hitIndex + 1);
      response.print("<input type=\"button\" class=\"button\" οnclick=\"return toggleMe('hit_" +
        hitNumber + "')\" value=\"Click here Get " + hitNumber + " content\">");
 
重新编译后,效果还不错呢!
 
 类似资料: