相关软件介绍/HBase/HBase中关于中文的处理

优质

小牛编辑

143浏览

2023-12-01

1、HBase版本hbase-0.20.5，Hadoop的版本hadoop-0.20.2，JDK1.6

2、在HBase中创建了表，如果想通过控制台使用命令写入含有汉字的数据，录入是不成功的。

3、如果想对汉字进行录入，可以通过代码实现，这里我使用的是java，代码如下：

    /**
	 * 向指定的表插入单个Put对象
	 * 
	 * @param tablename
	 * @param conf
	 * @throws Exception
	 */
	public static void insertData(String tableName, HBaseConfiguration conf) {
		HTable table = null;
		try {
			if (table == null) {
				table = new HTable(conf, tableName);
			}
			// 这里我使用time+6位随机数为row关键字,确保不重复
			String rowname = System.currentTimeMillis() / 1000 + "" + CommUtil.getSixRadom();
			System.out.println("rowname = " + rowname);
			Put p = new Put(Bytes.toBytes(rowname));
			p.add("acc".getBytes(), new Long(System.currentTimeMillis()).longValue(), "大绝招".getBytes());
			p.add("pwd".getBytes(), new Long(System.currentTimeMillis()).longValue(), "123456".getBytes());
			p.add("sex".getBytes(), new Long(System.currentTimeMillis()).longValue(), "1".getBytes());
			p.add("age".getBytes(), new Long(System.currentTimeMillis()).longValue(), "2222".getBytes());
			table.put(p);
		} catch (Exception e) {
			e.printStackTrace();
		} finally {
			CommUtil.HBaseClose(table);
		}
	}

4、通过查询可以查找出汉字内容，代码如下：

    /**
	 * 扫描HBase数据内容
	 * 
	 * @param tablename
	 * @param conf
	 * @param table
	 */
	public static void scanData(String tableName, HBaseConfiguration conf) {
		HTable table = null;
		Scan s = null;
		ResultScanner scanner = null;
		try {
			if (table == null) {
				table = new HTable(conf, tableName);
			}
			s = new Scan();
			s.addColumn(Bytes.toBytes("acc"));
			scanner = table.getScanner(s);
			for (Result r = scanner.next(); r != null; r = scanner.next()) {
				byte[] value = r.getValue("acc".getBytes());
				String m = new String(value);
				System.out.println("Found row: " + m);
			}
		} catch (Exception e) {
			e.printStackTrace();
		} finally {
			scanner.close();
		}
	}

5、结果返回正常的汉字，这里指的是在windows环境下，eclipse控制台输出的是汉字。如果是在linux下先在控制台输出汉字，则和操作系统的编码格式有关系。

6、通过上面的步骤可以看出来，输入是通过代码循环录入，当然，上面我只是录入的一条数据内容。如果想把文件内容作为数据的输入，那么需要把代码修改，并打成jar包在linux上执行。这里涉及到了在hbase中采用MR代码来录入数据。至于如何配置，我会另外写一片文章来简单概述一下，这里我就默认已配置完成。

7、编写MR代码，读取文件数据，并录入到hbase表中，代码如下：

public class InsertDataToHBase {
	public static class InsertDataToHBaseMapper extends Mapper<Object, Text, NullContext, NullWritable> {
		public static String table1[] = { "field1", "field2", "field3"};
		public static String table2[] = { "field1", "field2", "field3"};
		public static String table3[] = { "field1", "field2", "field3"};
		public static HTable table = null;
		protected void setup(Context context) throws IOException, InterruptedException {
			HBaseConfiguration conf = new HBaseConfiguration();
			String table_name = context.getConfiguration().get("tabel_name");
			if (table == null) {
				table = new HTable(conf, table_name);
			}
		}
		public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
			String arr_value[] = value.toString().split("t");
			String table_name = context.getConfiguration().get("tabel_name");
			String temp_arr[] = table1;
			int temp_value_length = 0;
			if (table_name.trim().equals("table1")) {
				temp_arr = table1;
				temp_value_length = 3;
			} else if (table_name.trim().equals("table2")) {
				temp_arr = table2;
				temp_value_length = 3;
			} else if (table_name.trim().equals("table3")) {
				temp_arr = table3;
				temp_value_length = 3;
			}
			List<Put> list = new ArrayList<Put>();
			if (arr_value.length == temp_value_length) {
				String rowname = System.currentTimeMillis() / 1000 + "" + CommUtil.getSixRadom();
				Put p = new Put(Bytes.toBytes(rowname));
				for (int i = 0; i < temp_arr.length; i++) {
					p.add(temp_arr[i].getBytes(), "".getBytes(), arr_value[i].getBytes());
				}
				list.add(p);
			}
			table.put(list);
			table.flushCommits();
		}
	}
	public static void main(String[] args) throws Exception {
		Configuration conf = new Configuration();
		String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
		if (otherArgs.length != 3) {
			System.err.println("Usage: InsertDataToHBase <inpath> <outpath> <tablename>");
			System.exit(2);
		}
		conf.set("tabel_name", otherArgs[2]);
		Job job = new Job(conf, "InsertDataToHBase");
		job.setNumReduceTasks(0);
		job.setJarByClass(InsertDataToHBase.class);
		job.setMapperClass(InsertDataToHBaseMapper.class);
		FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
		FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
		// job.submit();
		System.exit(job.waitForCompletion(true) ? 0 : 1);
	}
}

8、对于数据文件，如果有汉字，要求是UTF-8的编码格式，这样，在通过代码读取数据内容的时候才能显示成中文。这里有个地方需要注意一下，如果数据文件是在windows上，那么你通过另存为转成UTF-8的编码格式，汉字会有3个位的头，如果是这样的数据录入，查询出来之后，汉字的头前会显示一个“?”，这个是由于windows编码格式和Linux不同所致。处理这样的问题可以将文件拷贝到Linux下，做如下操作：

iconv -f GBK -t UTF-8 gbk.txt -o utf8.bcp

另外需要注意的是，如果文件后缀是txt，那么你即使转换之后，在windows再次打开，系统也会默认给你增加3个位的头。所以建议数据文件不要以txt后缀命名。

9、汉字录入HBase之后，使用命令查看显示如下：

1279677870714210 column=acc:, timestamp=1279677870656, value=xB4xF3xBExF8xD5xD0

可以看到是ASCII编码，可以将这些内容拷贝到UE，以16进制查看，然后对应的修改，可以看到最终显示是汉字“大绝招”

10、总结一下，内容写的比较少，代码贴了一些，大家看看，如果觉得有写的不对的或是有疑问的地方，可以给我发邮件dajuezhao@gmail.com