问题：

利用汉明距离算法建立模糊连接索引

公西良骏

2023-03-14

我是新来的Java和地图减少，我试图写一个地图减少程序，读取程序中称为“字典”的列表单词，并使用汉明距离算法生成列表中距离为1的所有单词。我能够生成输出，但问题是它似乎非常低效，因为我需要将整个列表加载到ArrayList中，并且对于每个单词，我在Map方法中调用汉明距，所以我阅读整个列表两次，并运行汉明距离算法n*n次，其中n是列表中的字数。

请给我一些有效的方法。

这是代码。到目前为止，它还没有减速器。

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.net.URI;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.StringTokenizer;

import org.apache.hadoop.filecache.DistributedCache;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;

public class MapJoin {


    public static class MyMapper extends Mapper<LongWritable,Text, Text, Text> {


        private List<String> Lst = new ArrayList<String>();


        protected void setup(Context context) throws java.io.IOException, InterruptedException{
            Path[] files = DistributedCache.getLocalCacheFiles(context.getConfiguration());


            for (Path p : files) {
                if (p.getName().equals("dictionary.txt")) {
                    BufferedReader reader = new BufferedReader(new FileReader(p.toString()));
                    String line = reader.readLine();
                    while(line != null) {
                        String tokens = line.toString() ;

                        Lst.add(tokens);
                        line = reader.readLine();
                    }
                    reader.close();
                }
            }
            if (Lst.isEmpty()) {
                throw new IOException("Unable to load Abbrevation data.");
            }
        }


        public void map(LongWritable key, Text val, Context con)
                throws IOException,InterruptedException {

              String line1 = val.toString();
              StringTokenizer itr = new StringTokenizer(line1.toLowerCase()) ;
              while (itr.hasMoreTokens()) {

                  String key1 = itr.nextToken() ;  
                  String fnlstr = HammingDist(key1) ;

                      con.write(new Text(key1), new Text(fnlstr));

              }
            }


        private String HammingDist(String ky)
          {
              String result = "" ;
              for(String x :Lst)
              {
                  char[] s1 = ky.toCharArray();
                    char[] s2 = x.toCharArray();

                    int shorter = Math.min(s1.length, s2.length);
                    int longer = Math.max(s1.length, s2.length);

                    int distance = 0;
                    for (int i=0; i<shorter; i++) {
                        if (s1[i] != s2[i]) distance++;
                    }

                    distance += longer - shorter;

                    if (distance <2)
                    {
                        result = result +","+x ;
                    }
              }
              if(result == null) 
                  {
                  return "" ;
                  }
              else
              return result ;
          }
    }

  public static void main(String[] args) 
                  throws IOException, ClassNotFoundException, InterruptedException {

    Job job = new Job();
    job.setJarByClass(MapJoin.class);
    job.setJobName("MapJoin");
    job.setNumReduceTasks(0);

    try{
    DistributedCache.addCacheFile(new URI("/Input/dictionary.txt"), job.getConfiguration());
    }catch(Exception e){
        System.out.println(e);
    }

    job.setMapperClass(MyMapper.class);

    job.setMapOutputKeyClass(Text.class);
    job.setMapOutputValueClass(Text.class);

    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));

    job.waitForCompletion(true);


  }
}

共有1个答案

叶福

2023-03-14

对于您在map中找到的每个令牌，您将调用HammingDist，这将迭代List

类似资料：

hammingDistance - 汉明距离

计算两个值之间的汉明距离。使用XOR运算符( ^ )查找这两个数字之间的位差，使用 toString(2) 转换为二进制字符串。使用 match(/1/g) 计算并返回字符串中 1 的数量。 const hammingDistance = (num1, num2) => ((num1 ^ num2).toString(2).match(/1/g) || '').length; hammingD
利用postgis计算旅行距离

我正在尝试写我的第一次postgis查询我的桌子如下所示这是一辆id为1的车辆的gps数据。我需要计算车辆行驶的总距离，比方说2017-05-20年，以米为单位。可以有其他具有不同ID的视图。参考（如）。https://gis.stackexchange.com/Questions/268776/Finding-Total-Distance of-Path-Along-Post-PointG
SQL中二进制字符串的汉明距离

问题内容：我在数据库中有一个表，其中将SHA256哈希存储在BINARY（32）列中。我正在寻找一种计算列中条目到提供值的汉明距离的方法，例如：（如果您想知道，字符串A和B的汉明距离定义为，其中^是按位XOR运算符，而BIT_COUNT返回二进制字符串中1的数目）。现在，我知道^运算符和BIT_COUNT函数都只能在INTEGER上使用，所以我想说，唯一的方法是将子字符串中的二进制字符串分解
计算行驶距离（不是距离）

我需要计算汽车行驶的距离！不是距离，不是距离到否。如果我们通过谷歌提供的API计算，距离可以完全不同。谷歌可以提供从一个点到另一个点的1公里距离，但汽车可以按照骑手想要的方式行驶800米。使用加速计没有帮助。它适用于步行，但绝不适用于更快的速度。我尝试过使用Google的位置API：距离到或距离之间根本不是一个选项。它可以给出与IN REAL截然不同的结果。在真实的汽车中，可以通过非常短的地方并
无法建立新连接Discord.py

你好在我的代码它打破了我的请求，我尝试了几次，但1-2小时后bot状态不再改变。我在ftp服务器上托管这些文件。以后从未检索到任务异常：异常=ConnectionError（MaxRetryError（“HTTPConnectionPool（host='username.mydomain'，port=80”）：url超过最大重试次数：/project/total_visit/count.txt
建立连接-引擎

SQLAlchemy 1.4 / 2.0 Tutorial 此页是 SQLAlchemy 1.4/2.0教程 . 上一页： SQLAlchemy 1.4/2.0教程 |下一步： |next| 建立连接-引擎任何SQLAlchemy应用程序的开始都是一个名为 Engine . 此对象充当连接到特定数据库的中心源，提供工厂和称为 connection pool 对于这些数据库连接。引擎通常是一个只为

利用汉明距离算法建立模糊连接索引

共有1个答案

相关问答

相关文章

相关阅读

相关工具

相关文档