当前位置: 首页 > 知识库问答 >
问题:

无法使用elephant-bird查询配置单元中的示例AddressBook protobuf数据

鄢松
2023-03-14
add jar /home/foo/downloads/elephant-bird/hadoop-compat/target/elephant-bird-hadoop-compat-4.6-SNAPSHOT.jar;
add jar /home/foo/downloads/elephant-bird/core/target/elephant-bird-core-4.6-SNAPSHOT.jar;
add jar /home/foo/downloads/elephant-bird/hive/target/elephant-bird-hive-4.6-SNAPSHOT.jar;

create external table addresses
row format serde "com.twitter.elephantbird.hive.serde.ProtobufDeserializer"
with serdeproperties (
"serialization.class"="com.twitter.data.proto.tutorial.AddressBookProtos$AddressBook")
STORED AS
-- elephant-bird provides an input format for use with hive
INPUTFORMAT "com.twitter.elephantbird.mapred.input.DeprecatedRawMultiInputFormat"
-- placeholder as we will not be writing to this table
OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
LOCATION '/user/foo/data/elephant-bird/addressbooks/';


describe formatted addresses;

OK
# col_name              data_type               comment

person array{ struct{  name:string, id:int, email:string, phone:array {struct {number:string, type:string}}}}  from deserializer
byteData                binary                  from deserializer

# Detailed Table Information
Database:               default
Owner:                  foo
CreateTime:             Tue Oct 28 13:49:53 PDT 2014
LastAccessTime:         UNKNOWN
Protect Mode:           None
Retention:              0
Location:               hdfs://foo:8020/user/foo/data/elephant-bird/addressbooks
Table Type:             EXTERNAL_TABLE
Table Parameters:
        EXTERNAL                TRUE
        transient_lastDdlTime   1414529393

# Storage Information
SerDe Library:          com.twitter.elephantbird.hive.serde.ProtobufDeserializer
InputFormat:            com.twitter.elephantbird.mapred.input.DeprecatedRawMultiInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed:             No
Num Buckets:            -1
Bucket Columns:         []
Sort Columns:           []
Storage Desc Params:
        serialization.class     com.twitter.data.proto.tutorial.AddressBookProtos$AddressBook
        serialization.format    1
Time taken: 0.421 seconds, Fetched: 29 row(s)
select count(*) from addresses;

Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=
In order to set a constant number of reducers:
  set mapred.reduce.tasks=
Starting Job = job_1413311929339_0061, Tracking URL = http://foo:8088/proxy/application_1413311929339_0061/
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1413311929339_0061
Hadoop job information for Stage-1: number of mappers: 0; number of reducers: 1
2014-10-28 13:50:37,674 Stage-1 map = 0%,  reduce = 0%
2014-10-28 13:50:51,055 Stage-1 map = 0%,  reduce = 100%, Cumulative CPU 2.14 sec
2014-10-28 13:50:52,152 Stage-1 map = 0%,  reduce = 100%, Cumulative CPU 2.14 sec
MapReduce Total cumulative CPU time: 2 seconds 140 msec
Ended Job = job_1413311929339_0061
MapReduce Jobs Launched:
Job 0: Reduce: 1   Cumulative CPU: 2.14 sec   HDFS Read: 0 HDFS Write: 2 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 140 msec
OK
0
Time taken: 37.519 seconds, Fetched: 1 row(s)
Thrift 0.7
protobuf: libprotoc 2.5.0
hadoop:
Hadoop 2.5.0-cdh5.2.0
Subversion http://github.com/cloudera/hadoop -r e1f20a08bde76a33b79df026d00a0c91b2298387
Compiled by jenkins on 2014-10-11T21:00Z
Compiled with protoc 2.5.0
From source with checksum 309bccd135b199bdfdd6df5f3f4153d

    Error: java.io.IOException: java.lang.reflect.InvocationTargetException
    at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
    at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
    at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:346)
    at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.(HadoopShimsSecure.java:293)
    at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getRecordReader(HadoopShimsSecure.java:407)
    at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:560)
    at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.(MapTask.java:168)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:409)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
    Caused by: java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
    at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:332)
    ... 11 more
    Caused by: java.io.IOException: No codec for file hdfs://foo:8020/user/foo/data/elephantbird/addressbooks/1000AddressBooks-1684394246.bin found
    at com.twitter.elephantbird.mapreduce.input.MultiInputFormat.determineFileFormat(MultiInputFormat.java:176)
    at com.twitter.elephantbird.mapreduce.input.MultiInputFormat.createRecordReader(MultiInputFormat.java:88)
    at com.twitter.elephantbird.mapreduce.input.RawMultiInputFormat.createRecordReader(RawMultiInputFormat.java:36)
    at com.twitter.elephantbird.mapred.input.DeprecatedInputFormatWrapper$RecordReaderWrapper.(DeprecatedInputFormatWrapper.java:256)
    at com.twitter.elephantbird.mapred.input.DeprecatedInputFormatWrapper.getRecordReader(DeprecatedInputFormatWrapper.java:121)
    at com.twitter.elephantbird.mapred.input.DeprecatedFileInputFormatWrapper.getRecordReader(DeprecatedFileInputFormatWrapper.java:55)
    at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.(CombineHiveRecordReader.java:65)
    ... 16 more

共有1个答案

范承望
2023-03-14

你解决这个问题了吗?

我遇到了和你描述的一样的问题。

是的。你是对的,我发现原始的二进制protobuf不能直接读取。

 类似资料:
  • Elephant Bird 是 Twitter 上 LZO、Hadoop 缓存相关协议、Pig、Hive 和 HBase 代码的集合。(library of LZO, Thrift, and/or Protocol Buffer-related Hadoop InputFormats, OutputFormats, Writables, Pig LoadFuncs, Hive SerDe, HBa

  • 我只是切换到CDH 5.9.0(一个全新的安装,而不是升级,在一个新的集群上)。我有一个这样的表(稍微复杂一点,但我也用这个例子复制了): 如果我这样做: ==>查询成功并返回所有产品头(JSON格式) 但如果我这么做了:

  • 我在执行配置单元查询时遇到异常。我关注以下链接:http://www.thecloudavenue.com/2013/03/analysis-tweets-using-flume-hadoop-and.html 终端数据在这里:

  • 我已经创建了一个Hive托管表,并使用hadoop commnad在托管表位置复制数据。这样做之后,每当我从表中选择*时,它都不会显示任何数据。我也尝试过msck修复命令。 但我仍然无法看到任何数据使用选择逗号我有检查在托管表位置文件是可用的,但使用选择命令我不能数据。 有人能告诉我为什么我不能使用选择命令查看数据吗?注意:我的hive表是在月份列上分区的。在复制数据之前,我已经启用了下面的属性。

  • 无法通过jupyter笔记本使用pyspark将数据写入hive。 给我下面的错误 Py4JJavaError:调用o99.saveAsTable时发生错误。:org.apache.spark.sql.分析异常:java.lang.运行时异常:java.lang.运行时异常:无法实例化org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreCl