当前位置: 首页 > 工具软件 > Jcseg > 使用案例 >

Jcseg分词 介绍

吕宣
2023-12-01

今天给大家介绍一下 Jcseg 分词    首先我先来 让大家跑通一个程序然后大家在慢慢研究   步骤如下

1、解压这个jar包 , jcseg-1.9.4-src-jar-dict.zip     下载路径:http://download.csdn.net/detail/u010310183/8041677

2、自己建立个项目   

1) 首先创建一个 config 文件夹   , 文件夹下 创建 jcseg.properties 。  jcseg.properties中内容如下:

  

# jcseg properties file.
# bug report chenxin <chenxin619315@gmail.com>

# Jcseg function
#maximum match length. (5-7)
jcseg.maxlen=5

#recognized the chinese name.(1 to open and 0 to close it)
jcseg.icnname=1

#maximum chinese word number of english chinese mixed word. 
jcseg.mixcnlen=2

#maximum length for pair punctuation text.
jcseg.pptmaxlen=15

#maximum length for chinese last name andron.
jcseg.cnmaxlnadron=1

#Wether to clear the stopwords.(set 1 to clear stopwords and 0 to close it)
jcseg.clearstopword=0

#Wether to convert the chinese numeric to arabic number. (set to 1 open it and 0 to close it)
# like '\u4E09\u4E07' to 30000.
jcseg.cnnumtoarabic=1

#Wether to convert the chinese fraction to arabic fraction.
jcseg.cnfratoarabic=1

#Wether to keep the unrecognized word. (set 1 to keep unrecognized word and 0 to clear it)
jcseg.keepunregword=1

#Wether to start the secondary segmentation for the complex english words.
jcseg.ensencondseg = 1

#min length of the secondary simple token. (better larger than 1)
jcseg.stokenminlen = 2

#thrshold for chinese name recognize.
# better not change it before you know what you are doing.
jcseg.nsthreshold=1000000

#The punctuations that will be keep in an token.(Not the end of the token).
jcseg.keeppunctuations=@%.&+




####about the lexicon
#prefix of lexicon file.
lexicon.prefix=lex

#suffix of lexicon file.
lexicon.suffix=lex

#abusolte path of the lexicon file.
#Multiple path support from jcseg 1.9.2, use ';' to split different path.
#example: lexicon.path = /home/chenxin/lex1;/home/chenxin/lex2 (Linux)
#		: lexicon.path = D:/jcseg/lexicon/1;D:/jcseg/lexicon/2 (WinNT)
lexicon.path=D:/jcseg/lexicon/jcseg-1.9.4-src-jar-dict/jcseg-1.9.4/lexicon

#Wether to load the modified lexicon file auto.
lexicon.autoload=1

#Poll time for auto load. (seconds)
lexicon.polltime=120




####lexicon load
#Wether to load the part of speech of the entry.
jcseg.loadpos=1

#Wether to load the pinyin of the entry.
jcseg.loadpinyin=0

#Wether to load the synoyms words of the entry.
jcseg.loadsyn=1

2)在创建一个类   类中内容如下

package jcseg;

import java.io.IOException;
import java.io.StringReader;

import org.lionsoul.jcseg.ASegment;
import org.lionsoul.jcseg.core.ADictionary;
import org.lionsoul.jcseg.core.DictionaryFactory;
import org.lionsoul.jcseg.core.ILexicon;
import org.lionsoul.jcseg.core.IWord;
import org.lionsoul.jcseg.core.JcsegException;
import org.lionsoul.jcseg.core.JcsegTaskConfig;
import org.lionsoul.jcseg.core.SegmentFactory;

public class test {

	public static void main(String[] args) throws IOException, JcsegException {

		//创建JcsegTaskConfig分词任务实例
		//即从jcseg.properties配置文件中初始化的配置
		JcsegTaskConfig config = new JcsegTaskConfig("config/jcseg.properties");
		//config.setAppendCJKPinyin(true);
		//创建默认词库(即: com.webssky.jcseg.Dictionary对象)
		//并且依据给定的JcsegTaskConfig配置实例自主完成词库的加载
		ADictionary dic = DictionaryFactory
		.createDefaultDictionary(config,true);
		
		dic.loadFromLexiconFile("D:/jcseg/lexicon/jcseg-1.9.4-src-jar-dict/jcseg-1.9.4/lexicon/lex-main.lex");//这个路径是jcseg-1.9.4-src-jar-dict.zip 这个jar 包的  存放路径, 你自己找lexicon  文件夹下的 lex-main.lex
		
		//dic.loadFromLexiconDirectory(config, config.getLexiconPath());
		
//		System.out.println(w);
		
		//依据给定的ADictionary和JcsegTaskConfig来创建ISegment
		//通常使用SegmentFactory#createJcseg来创建ISegment对象
		//将config和dic组成一个Object数组给SegmentFactory.createJcseg方法
		//JcsegTaskConfig.COMPLEX_MODE表示创建ComplexSeg复杂ISegment分词对象
		//JcsegTaskConfig.SIMPLE_MODE表示创建SimpleSeg简易Isegmengt分词对象.
		ASegment seg = (ASegment) SegmentFactory.createJcseg(JcsegTaskConfig.COMPLEX_MODE,new Object[]{config, dic});
		//设置要分词的内容
		String str = "研究";
		seg.reset(new StringReader(str));
		//获取分词结果
		IWord word = null;
		while ( (word = seg.next()) != null ) {
		System.out.println(word.getValue());
		}
	}
}
3、运行项目      这个项目主要先让你跑起来, 当你输入 研究的时候,会把字库中关于   研究的 相近词 都查询出来。

4、想更多了解 Jcseg 功能  请下载 文档介绍     下载地址如下: http://download.csdn.net/detail/u010310183/8041725


 类似资料: