word2vec使用过程（Java版）

梁华清

2023-12-01

这里只介绍如何使用，不介绍原理（想要了解原理的看这里）

1.下载Word2Vec（Java版地址）

2.根据自己情况准备语料库（搜狗2012全网新闻数据）

3.处理语料库。
以搜狗2012全网新闻数据为例：
(1)首先处理掉HTML标签并转为utf8编码格式：cat news_tensite_xml.dat | iconv -f gb18030 -t utf-8 -c | grep "<content>" > corpus.txt
(2)进行分词处理，这里使用的ANSJ（jar包下载地址）：

public class Test {  
    public static final String TAG_START_CONTENT = "<content>";  
    public static final String TAG_END_CONTENT = "</content>";  

    public static void main(String[] args) {  
        String temp = null ;  

        BufferedReader reader = null;  
        PrintWriter pw = null;  
        System.out.println("开始分词...");
        try {  
            //分词之前的文件路径
            File file = new File("C:/users/xxx/Desktop/xxx");
            InputStreamReader read = new InputStreamReader(new FileInputStream(file), "UTF-8");
            reader = new BufferedReader(read);
            //分词之后的文件路径
            pw = new PrintWriter("C:/users/xxx/Desktop/xxx");  
            long start = System.currentTimeMillis()  ;  
            int allCount =0 ;  
            int termcnt = 0;  
            Set<String> set = new HashSet<String>();  
            while((temp=reader.readLine())!=null){  
                temp = temp.trim();  
                if (temp.startsWith(TAG_START_CONTENT)) {  
                    int end = temp.indexOf(TAG_END_CONTENT);  
                    String content = temp.substring(TAG_START_CONTENT.length(), end);  
                    if (content.length() > 0) {  
                        allCount += content.length() ;  
                        List<Term> result = ToAnalysis.parse(content);  
                        for (Term term: result) {  
                            String item = term.getName().trim();  
                            if (item.length() > 0) {  
                                termcnt++;  
                                pw.print(item.trim() + " ");  
                                set.add(item);  
                            }  
                        }  
                        pw.println();  
                    }  
                }  
            }  
            long end = System.currentTimeMillis() ;  
            System.out.println("已完成！");
            System.out.println("共" + termcnt + "个term，" + set.size() + "个不同的词，共 "  
                    +allCount+" 个字符，每秒处理了:"+(allCount*1000.0/(end-start)));  
        } catch (IOException e) {   
            e.printStackTrace();  
        } finally {  
            if (null != reader) {  
                try {  
                    reader.close();  
                } catch (IOException e) {  
                    e.printStackTrace();  
                }  
            }  
            if (null != pw) {  
                pw.close();  
            }  
        }  
    }  
}

4.开始训练
使用刚刚下载的Word2Vec，其中有一个Learn类，改一下里面的路径，开始训练就好。
时间可能比较长，另外，需要改一下分配给jvm的最大内存大小，不然会out of memory。
eclipse这样修改：Run->Run Configurations->Arguments，在VM arguments里面添加-Xmx3072m。这里的3072m是分配给jvm的内存的大小，根据自己需要填写数值就好。
这里有一份我训练好的模型（使用搜狗2012全网新闻数据）：
链接：http://pan.baidu.com/s/1geEyJnH 密码：wbu7

5.最后就可以直接使用训练出来的模型了：

Word2VEC vec = new Word2VEC();
//训练出来的模型的路径
vec.loadJavaModel("C:/xxx/xxx");  
String str = "哈哈";
System.out.println(vec.distance(str));

word2vec使用过程（Java版）

相关阅读

相关文章

相关问答

相关文档