问题：

对两个文本文件使用余弦相似性

叶浩荡

2023-03-14

我试图用余弦相似性来找出两个文本文件的相似性。当我提供文本时，我可以发现这一点。但我想在阅读完电脑中的文本文件后得到结果。

//calculates the cosine similarity between two texts / documents etc., (having each word separated by space)

public class Cosine_Similarity
{
    public class values
    {
        int val1;
        int val2;
        values(int v1, int v2)
        {
            this.val1=v1;
            this.val2=v2;
        }

        public void Update_VAl(int v1, int v2)
        {
            this.val1=v1;
            this.val2=v2;
        }
    }//end of class values

    public double Cosine_Similarity_Score(String Text1, String Text2)
    {
        double sim_score=0.0000000;
        //1. Identify distinct words from both documents
        String [] word_seq_text1 = Text1.split(" ");
        String [] word_seq_text2 = Text2.split(" ");
        Hashtable<String, values> word_freq_vector = new Hashtable<String, 
        Cosine_Similarity.values>();
        LinkedList<String> Distinct_words_text_1_2 = new LinkedList<String>();

        //prepare word frequency vector by using Text1
        for(int i=0;i<word_seq_text1.length;i++)
        {
            String tmp_wd = word_seq_text1[i].trim();
            if(tmp_wd.length()>0)
            {
                if(word_freq_vector.containsKey(tmp_wd))
                {
                    values vals1 = word_freq_vector.get(tmp_wd);
                    int freq1 = vals1.val1+1;
                    int freq2 = vals1.val2;
                    vals1.Update_VAl(freq1, freq2);
                    word_freq_vector.put(tmp_wd, vals1);
                }
                else
                {
                    values vals1 = new values(1, 0);
                    word_freq_vector.put(tmp_wd, vals1);
                    Distinct_words_text_1_2.add(tmp_wd);
                }
            }
        }

        //prepare word frequency vector by using Text2
        for(int i=0;i<word_seq_text2.length;i++)
        {
            String tmp_wd = word_seq_text2[i].trim();
            if(tmp_wd.length()>0)
            {
                if(word_freq_vector.containsKey(tmp_wd))
                {
                    values vals1 = word_freq_vector.get(tmp_wd);
                    int freq1 = vals1.val1;
                    int freq2 = vals1.val2+1;
                    vals1.Update_VAl(freq1, freq2);
                    word_freq_vector.put(tmp_wd, vals1);
                }
                else
                {
                    values vals1 = new values(0, 1);
                    word_freq_vector.put(tmp_wd, vals1);
                    Distinct_words_text_1_2.add(tmp_wd);
                }
            }
        }

        //calculate the cosine similarity score.
        double VectAB = 0.0000000;
        double VectA_Sq = 0.0000000;
        double VectB_Sq = 0.0000000;

        for(int i=0;i<Distinct_words_text_1_2.size();i++)
        {
            values vals12 = word_freq_vector.get(Distinct_words_text_1_2.get(i));

            double freq1 = (double)vals12.val1;
            double freq2 = (double)vals12.val2;
            System.out.println(Distinct_words_text_1_2.get(i)+"#"+freq1+"#"+freq2);

            VectAB=VectAB+(freq1*freq2);

            VectA_Sq = VectA_Sq + freq1*freq1;
            VectB_Sq = VectB_Sq + freq2*freq2;
        }

        System.out.println("VectAB "+VectAB+" VectA_Sq "+VectA_Sq+" VectB_Sq "+VectB_Sq);
        sim_score = ((VectAB)/(Math.sqrt(VectA_Sq)*Math.sqrt(VectB_Sq)));

        return(sim_score);
    }

    public static void main(String[] args)
    {
        Cosine_Similarity cs1 = new Cosine_Similarity();

        System.out.println("[Word # VectorA # VectorB]");
        double sim_score = cs1.Cosine_Similarity_Score("this is text file one", "this is text file two");
        System.out.println("Cosine similarity score = "+sim_score);
    }
}

共有2个答案

俞涵涤

2023-03-14

当您运行程序时，您可以通过在命令行中给出它们的路径来指定您想要的文件，然后在代码中使用它们作为args。例如，你必须运行你的程序javaCosine_Similaritypath_to_text1path_to_text2

double sim_score = cs1.Cosine_Similarity_Score(args[0], args[1]);

目前，您所做的只是比较两个字符串。对于短字符串，您可以简单地将它们作为参数。如果要使用实际文件，则需要提供文件路径作为参数，然后将文件内容转换为单个字符串，然后进行比较。看看这个答案：

在Java中作为参数传递文件路径

史鸿运

2023-03-14

在您的代码中，您可以比较两个文本字符串，但不能比较两个文件，因此您可以通过将两个文件转换为两个文本字符串来比较它们。为此，您可以逐行读取每个文件，并使用空格作为分隔符将它们连接起来。

public static void main(String[] args) throws IOException {
    Cosine_Similarity cs = new Cosine_Similarity();

    // read file 1 and convert into a String
    String text1 = Files.readAllLines(Paths.get("path/to/file1")).stream().collect(Collectors.joining(" "));
    // read file 2 and convert into a String
    String text2 = Files.readAllLines(Paths.get("path/to/file2")).stream().collect(Collectors.joining(" "));

    double score = cs.Cosine_Similarity_Score(text1, text2);
    System.out.println("Cosine similarity score = " + score);
}

顺便说一句，阅读有关约定并遵循它们！

一个例子：

public class CosineSimilarity {

    private static class Values {

        private int val1;
        private int val2;

        private Values(int v1, int v2) {
            this.val1 = v1;
            this.val2 = v2;
        }

        public void updateValues(int v1, int v2) {
            this.val1 = v1;
            this.val2 = v2;
        }
    }//end of class values

    public double score(String text1, String text2) {
        //1. Identify distinct words from both documents
        String[] text1Words = text1.split(" ");
        String[] text2Words = text2.split(" ");
        Map<String, Values> wordFreqVector = new HashMap<>();
        List<String> distinctWords = new ArrayList<>();

        //prepare word frequency vector by using Text1
        for (String text : text1Words) {
            String word = text.trim();
            if (!word.isEmpty()) {
                if (wordFreqVector.containsKey(word)) {
                    Values vals1 = wordFreqVector.get(word);
                    int freq1 = vals1.val1 + 1;
                    int freq2 = vals1.val2;
                    vals1.updateValues(freq1, freq2);
                    wordFreqVector.put(word, vals1);
                } else {
                    Values vals1 = new Values(1, 0);
                    wordFreqVector.put(word, vals1);
                    distinctWords.add(word);
                }
            }
        }

        //prepare word frequency vector by using Text2
        for (String text : text2Words) {
            String word = text.trim();
            if (!word.isEmpty()) {
                if (wordFreqVector.containsKey(word)) {
                    Values vals1 = wordFreqVector.get(word);
                    int freq1 = vals1.val1;
                    int freq2 = vals1.val2 + 1;
                    vals1.updateValues(freq1, freq2);
                    wordFreqVector.put(word, vals1);
                } else {
                    Values vals1 = new Values(0, 1);
                    wordFreqVector.put(word, vals1);
                    distinctWords.add(word);
                }
            }
        }

        //calculate the cosine similarity score.
        double vectAB = 0.0000000;
        double vectA = 0.0000000;
        double vectB = 0.0000000;
        for (int i = 0; i < distinctWords.size(); i++) {
            Values vals12 = wordFreqVector.get(distinctWords.get(i));
            double freq1 = vals12.val1;
            double freq2 = vals12.val2;
            System.out.println(distinctWords.get(i) + "#" + freq1 + "#" + freq2);
            vectAB = vectAB + freq1 * freq2;
            vectA = vectA + freq1 * freq1;
            vectB = vectB + freq2 * freq2;
        }

        System.out.println("VectAB " + vectAB + " VectA_Sq " + vectA + " VectB_Sq " + vectB);
        return ((vectAB) / (Math.sqrt(vectA) * Math.sqrt(vectB)));
    }

    public static void main(String[] args) throws IOException {
        CosineSimilarity cs = new CosineSimilarity();

        String text1 = Files.readAllLines(Paths.get("path/to/file1")).stream().collect(Collectors.joining(" "));
        String text2 = Files.readAllLines(Paths.get("path/to/file2")).stream().collect(Collectors.joining(" "));

        double score = cs.score(text1, text2);
        System.out.println("Cosine similarity score = " + score);
    }

}

类似资料：

余弦相似度

问题内容：我计算了两个文档的tf / idf值。以下是tf / idf值：这些文件就像：如何使用这些值来计算余弦相似度？我知道我应该计算点积，然后找到距离并除以点积。如何使用我的值来计算？还有一个问题：两个文档的字数相同是否重要？问题答案： a * b是点积一些细节：是。在某种程度上，a和b必须具有相同的长度。但是a和b通常具有稀疏表示，您只需要存储非零条目，就可以更快地计算范数
两个pyspark数据帧的余弦相似性

我有一个PySpark数据帧，df1，看起来像: 我有第二个PySpark数据帧，df2 我想得到两个数据帧的余弦相似性。并有类似的东西
TF-IDF与余弦相似性的应用（二）找出相似文章

本文向大家介绍TF-IDF与余弦相似性的应用（二）找出相似文章，包括了TF-IDF与余弦相似性的应用（二）找出相似文章的使用技巧和注意事项，需要的朋友参考一下上一次，我用TF-IDF算法自动提取关键词。今天，我们再来研究另一个相关的问题。有些时候，除了找到关键词，我们还希望找到与原文章相似的其他文章。比如，"Google新闻"在主新闻下方，还提供多条相似的新闻。为了找出相似的文章，需要用
如何计算两个向量的余弦相似度？

问题内容：如何找到向量之间的余弦相似度？我需要找到相似性来衡量两行文本之间的相关性。例如，我有两个句子：用户界面系统用户界面机 …及其在tF-idf之后的向量，然后使用LSI进行标准化，例如和。如何测量这些向量之间的相似性？问题答案：我最近在大学的信息检索部门做了一些tf-idf的工作。我使用了这种余弦相似度方法，该方法使用Jama：Java Matrix Package 。有
余弦相似度的SQL计算

问题内容：假设您在数据库中按以下方式构造了一个表：为了清楚起见，应输出：请注意，由于向量存储在数据库中，因此我们仅需要存储非零条目。在此示例中，我们只有两个向量$ v_ {99} =（4,3,4,0）$和$ v_ {1234} =（0,5,2,3）$都在$ \ mathbb {R}中^ 4 $。这些向量的余弦相似度应为$ \ displaystyle \ frac {23} {\ sqrt
计算Keras中两个张量之间的余弦相似度

问题内容：我一直在遵循一个教程，该教程显示了如何制作word2vec模型。本教程使用以下代码：（未提供其他信息，但我想这来自）现在，我已经对该方法进行了一些研究，但对此却知之甚少。据我了解，它已被许多功能取代。我应该使用什么？有，它有一个参数（似乎正确），但没有参数。在这种情况下我可以使用什么？问题答案： Keras文档中有一些尚不清楚的事情，我认为了解这些至关重要：对于keras

对两个文本文件使用余弦相似性

共有2个答案

相关问答

相关文章

相关阅读

相关工具

相关文档