改造nutch,将原来的中文分词程序改成imdict-chinese-analyzer,这个是根据中科院的c版分词程序而来的纯java版。
我下的是imdict-chinese-analyzer-java5.zip
nutch1.0
public class ParseException extends Exception
public class ParseException extends IOException
/** Returns a new token stream for text from the named field. */
public TokenStream tokenStream(String fieldName, Reader reader) {
Analyzer analyzer;
/*
if ("anchor".equals(fieldName))
analyzer = ANCHOR_ANALYZER;
else
analyzer = CONTENT_ANALYZER;
*/
analyzer = new org.apache.lucene.analysis.cn.SmartChineseAnalyzer(true);
//
return analyzer.tokenStream(fieldName, reader);
}