Carrot2 聚类算法概要说明
一、实验环境:Carrort2
输入数据类型:数组
输入值:
String[][] documents = new String[][] {
{ "Introduction yourSelf", "上海" },// 0
{ "KD Nuggets", "中国上海" },// 1
{ "The Data Mine", "上海" },// 2
{ "DMG", "上海浦东" },// 3
{ "Two Crows: Data mining glossary", "中国上海" },// 4
{ "中国上海浦东",
"http://www-db.stanford.edu/~ullman/mining/mining.html" },// 5
{ "中国上海", "http://www.thearling.com/" },// 6
{ "中国上海浦西",
"http://www.eco.utexas.edu/~norman/BUS.FOR/course.mat/Alex" },// 7
{ "CCSU - Data Mining", "中国上海浦东" },// 8
{
"Data Mining: Practical Machine Learning Tools and Techniques",
"中国上海浦西" },// 19
{ "Data Mining - Monografias.com", "中国上海" } // 10
输出结果为:Clusters:
CL-1 中国上海 (4 documents)
1: dummy://6
2: dummy://10
3: dummy://1
4: dummy://4
CL-2 Data Mining (4 documents)
1: dummy://2
2: dummy://10
3: dummy://8
4: dummy://4
CL-3 中国上海浦西 (2 documents)
1: dummy://7
2: dummy://9
二、算法主要处理的类及方法
LocalControllerBase.queryàLocalProcessBase.queryàLocalInputComponentBase.endProcessing()àLingoLocalFilterComponentàMultilingualClusteringContext.cluster()àMultilingualClusteringContext.extractFeatures()àLsiClusteringStrategy.cluster()
三、算法跟踪
第一部分,抽取数据:目的,得到Feature[] singleTerms
1) 将所有的documents用特定的分隔符组成一个字符串,如下:
introduct yourself . 上海 | KD nugget . 中国上海 | the data mine . 上海 |
DMG . 上海浦东 | two crow data mine glossari . 中国上海 | 中国上海浦东 .
ullman mine | 中国上海 . | 中国上海浦西 . norman | CCSU data mine . 中国
上海浦东 | data mine practic machin learn tool and techniqu . 中国上海浦西
| data mine monografias.com . 中国上海
2) 记录每一个词的code,所有不重复的词,如下:
words = hashMap<String,int>();
{and=23, 中国上海=5, glossari=13, norman=17, techniqu=24,
monografias.com=25, 上海浦东=10, crow=12, nugget=4, tool=22, ccsu=18, kd=3
ullman=15, the=6, practic=19, dmg=9, machin=20, yourself=1, two=11, data=7,
introduct=0, 中国上海浦东=14, mine=8, learn=21, 上海=2, 中国上海浦西=16}
包装关键词,记录它们出现的次数及位置
伪代码如下:
遍历所有的词{
记录每个词的位置、code、及文档的id
}
以下给出运行时变量值:
//所有的词共有63个,其中不重复的是26个,-1表示结束,大的
整数表示标点符号或者分隔符“|”
intData[63] = [0, 1, 2147483647, 2, 2147483646, 3, 4, 2147483645, 5,
2147483644, 6, 7, 8, 2147483643, 2, 2147483642, 9, 2147483641, 10,
2147483640, 11, 12, 7, 8, 13, 2147483639, 5, 2147483638, 14, 2147483637,
15, 8, 2147483636, 5, 2147483635, 2147483634, 16, 2147483633,
17,2147483632, 18, 7, 8, 2147483631, 14, 2147483630, 7, 8, 19, 20, 21, 22,
23, 24, 2147483629, 16, 2147483628, 7, 8, 25, 2147483627, 5, -1]
//词在字串中的位置
wordPositions[63] = [0, 10, 19, 21, 24, 26, 29, 36, 38, 43, 45, 49, 54, 59, 61,
64, 66, 70, 72, 77, 79, 83, 88, 93, 98, 107, 109, 114, 116, 123, 125, 132, 137,
139, 144, 146, 148, 155, 157, 164, 166, 171, 176, 181, 183, 190, 192, 197,
202, 210, 217, 223, 228, 232, 241, 243, 250, 252, 257, 262, 278, 280, 285]
//这些词分别出现的文档
documentIndices[62] = [0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4,
4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 10, 10, 10, 10, 10, 10]
3)记录每个词出现的文档:下面的值是遍历intData[63]得出来的,记录每个
词出现的文档,比如introduct=0,第0个词,出现在第0个文档中
[[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 0, 1, 0, 0, 0, 0, 0,
0, 0, 0], [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 1,
0, 1, 0, 0, 0, 1], [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1], [0, 0,
1, 0, 1, 1, 0, 0, 1, 1, 1], [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 1, 0, 0, 0, 0,
0, 0], [0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0], [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0,
0, 1, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0], [0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]]
4)记录Feature[] singleTerms = new Feature[termCount]; //记录每个词的属性,
共26个词,tf表示在所有的文档中出现的次数。遍历所有的intData[63]得到结果。
[introduct tf=1 tfidf=0.00 ST null, yourself tf=1 tfidf=0.00 ST null, 上海 tf=2
tfidf=0.00 null, KD tf=1 tfidf=0.00 ST null, nugget tf=1 tfidf=0.00 ST null, 中国
上海 tf=4 tfidf=0.00 ST null, the tf=1 tfidf=0.00 ST null, data tf=5 tfidf=0.00
ST null, mine tf=6 tfidf=0.00 ST null, DMG tf=1 tfidf=0.00 ST null, 上海浦东
tf=1 tfidf=0.00 null, two tf=1 tfidf=0.00 ST null, crow tf=1 tfidf=0.00 ST null,
glossari tf=1 tfidf=0.00 ST null, 中国上海浦东 tf=2 tfidf=0.00 ST null, ullman
tf=1 tfidf=0.00 null, 中国上海浦西 tf=2 tfidf=0.00 ST null, norman tf=1
tfidf=0.00 null, CCSU tf=1 tfidf=0.00 ST null, practic tf=1 tfidf=0.00 ST null,
machin tf=1 tfidf=0.00 ST null, learn tf=1 tfidf=0.00 ST null, tool tf=1
tfidf=0.00 ST null, and tf=1 tfidf=0.00 ST null, techniqu tf=1 tfidf=0.00 ST null,
monografias.com tf=1 tfidf=0.00 ST null]
计算tfidf,判断是否是Strongword,如果是加权重。再次计算结果为:
[introduct tf=2 tfidf=4.80 ST null, yourself tf=2 tfidf=4.80 ST null, 上海 tf=2
tfidf=3.41 null, KD tf=2 tfidf=4.80 ST null, nugget tf=2 tfidf=4.80 ST null, 中国
上海 tf=10 tfidf=10.12 ST null, the tf=2 tfidf=4.80 ST null, data tf=12
tfidf=9.46 ST null, mine tf=15 tfidf=9.09 ST null, DMG tf=2 tfidf=4.80 ST null,
上海浦东 tf=1 tfidf=2.40 null, two tf=2 tfidf=4.80 ST null, crow tf=2
tfidf=4.80 ST null, glossari tf=2 tfidf=4.80 ST null, 中国上海浦东 tf=5
tfidf=8.52 ST null, ullman tf=1 tfidf=2.40 null, 中国上海浦西 tf=5 tfidf=8.52
ST null, norman tf=1 tfidf=2.40 null, CCSU tf=2 tfidf=4.80 ST null, practic tf=2 tfidf=4.80 ST null, machin tf=2 tfidf=4.80 ST null, learn tf=2 tfidf=4.80
ST null, tool tf=2 tfidf=4.80 ST null, and tf=2 tfidf=4.80 ST null, techniqu tf=2
tfidf=4.80 ST null, monografias.com tf=2 tfidf=4.80 ST null]
singleTerms最后一步,设置语言,格式化数据,如把glossari格式化为:glossary
[Introduction tf=2 tfidf=4.80 ST en, yourself tf=2 tfidf=4.80 SW ST en, 上海
tf=2 tfidf=3.41 en, KD tf=2 tfidf=4.80 ST en, Nuggets tf=2 tfidf=4.80 ST en, 中
国上海 tf=10 tfidf=10.12 ST en, the tf=2 tfidf=4.80 SW ST en, Data tf=12
tfidf=9.46 ST en, Mining tf=15 tfidf=9.09 ST en, DMG tf=2 tfidf=4.80 ST en, 上
海浦东 tf=1 tfidf=2.40 en, two tf=2 tfidf=4.80 SW ST en, Crows tf=2
tfidf=4.80 ST en, Glossary tf=2 tfidf=4.80 ST en, 中国上海浦东 tf=5
tfidf=8.52 ST en, Ullman tf=1 tfidf=2.40 en, 中国上海浦西 tf=5 tfidf=8.52 ST
en, Norman tf=1 tfidf=2.40 en, CCSU tf=2 tfidf=4.80 ST en, Practical tf=2
tfidf=4.80 ST en, Machine tf=2 tfidf=4.80 ST en, Learning tf=2 tfidf=4.80 ST en,
Tools tf=2 tfidf=4.80 ST en, and tf=2 tfidf=4.80 SW ST en, Techniques tf=2
tfidf=4.80 ST en, Monografias.com tf=2 tfidf=4.80 ST en]
说到这,应该给出Feature的属性及说明了,如下:
/** A unique integer code of this feature */
private int code;
/** String representation of this feature */
private String text;
/** ISO code of the language to which Lingo believes this feature belongs *
private String language;
/** The number of occurrences of this feature in all input snippets */
private int tf;
/**
singleTerms[i].setIdf(Math.log(
(double) 11 / 当前的word在所有的文档中出现的次
数); **/
private double idf;
/** Length (in words) of this feature */
private int length;
/** True if this feature is a stop word */ private boolean stopWord;
/** True if this feature is among query words */
private boolean queryWord;
/** True if this feature appeared in some snippet's title */
private boolean strong;
/**
* Used for phrase features only. An array pointing to features
* representing the phrase's individual words.
*/
private int[] phraseFeatureIndices;
/**
* 记录在哪些文档中出现了,比如“上海”这个词,这个值为:[0, 2]
*/
private int[] snippetIndices;
//即termDocument关于当前word的出现文档的数组,如: [1, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0],表示这个word在哪些文档中出现过。
private int[] snippetTf;
第二部分:聚类
1)排序,按非停词,tf倒排
[Mining tf=15 tfidf=9.09 ST en, Data tf=12 tfidf=9.46 ST en, 中国上海 tf=10
tfidf=10.12 ST en, 中国上海浦东 tf=5 tfidf=8.52 ST en, 中国上海
浦西 tf=5 tfidf=8.52 ST en, CCSU tf=2 tfidf=4.80 ST en, Crows tf=2 tfidf=4.80
ST en, DMG tf=2 tfidf=4.80 ST en, Glossary tf=2 tfidf=4.80 ST en, Introduction
tf=2 tfidf=4.80 ST en, KD tf=2 tfidf=4.80 ST en, Learning tf=2 tfidf=4.80 ST en,
Machine tf=2 tfidf=4.80 ST en, Monografias.com tf=2 tfidf=4.80 ST en,
Nuggets tf=2 tfidf=4.80 ST en, Practical tf=2 tfidf=4.80 ST en, Techniques tf=2
tfidf=4.80 ST en, Tools tf=2 tfidf=4.80 ST en, 上海 tf=2 tfidf=3.41 en,
Norman tf=1 tfidf=2.40 en, Ullman tf=1 tfidf=2.40 en, 上海浦东 tf=1
tfidf=2.40 en, and tf=2 tfidf=4.80 SW ST en, the tf=2 tfidf=4.80 SW ST en, two
tf=2 tfidf=4.80 SW ST en, yourself tf=2 tfidf=4.80 SW ST en]
2)以上26个关键词中,有的是停词,去除停词后生成矩阵
生成非停词的矩阵,这里二维数组的长度为:double[][] tdMatrix = new double [19][11]
矩阵生成策略:tdMatrix[term][snippetIndices[doc]] =
features[term].getSnippetTf()[snippetIndices[doc]] * features[term].getIdf();
tdMatrix = new Matrix(matrix);//其中Matrix是Jama-1.0.2的包,一个专门生
成矩阵策略的jar工具包。 (TfidfTdMatrixBuildingStrategy. buildTdMatrix)
3)根据矩阵,有N多复杂的计算,有兴趣请参阅源码:LsiClusteringStrategy. createClusters();抛开矩阵内部处理不说,有了这些数据,我们也可以编写自已的算法,得出哪些keyword或keywords集合出现在哪些文档中!达到聚类的效果!