问题：

Sqoop导入为Avro数据文件在Hive中创建外部Avro表时将所有值设置为NULL

冷正信

2023-03-14

我试图使用Sqoop import-free-form查询将Oracle DB数据导入HDFS，方法是使用Oozie调度器使用“-as avrodatafile”连接两个表。以下是我的工作流程的内容。xml：

<?xml version="1.0" encoding="UTF-8"?>
<workflow-app xmlns="uri:oozie:workflow:0.2" name="sqoop-freeform-wf">
    <start to="sqoop-freeform-node"/>

    <action name="sqoop-freeform-node">
        <sqoop xmlns="uri:oozie:sqoop-action:0.2">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <prepare>
                <delete path="/apps/hive/warehouse/loc_avro_import"/>
            </prepare>
            <arg>import</arg>
            <arg>--connect</arg>
            <arg>jdbc:oracle:thin:@connection-string:1521:ORCL</arg>
            <arg>--username</arg>
            <arg>comcast</arg>
            <arg>--password</arg>
            <arg>comcast123</arg>
            <arg>--query</arg>
            <arg>select location.location_id, location.street1,location_meta.display_name from location join location_meta on location.location_id=location_meta.location_id WHERE $CONDITIONS</arg>
            <arg>--target-dir</arg>
            <arg>/apps/hive/warehouse/loc_avro_import</arg>
            <arg>--split-by</arg>
            <arg>location.location_id</arg>
            <arg>--as-avrodatafile</arg>
            <arg>-m</arg>
            <arg>1</arg>
        </sqoop>
        <ok to="end"/>
        <error to="fail"/>
    </action>

    <kill name="fail">
        <message>Sqoop free form failed</message>
    </kill>
    <end name="end"/>
</workflow-app>

Oozie作业成功运行，并在HDFS上的目录 /apps/hive/warehouse/loc_avro_import下创建Avro文件以及_SUCCESS标志。然后我使用以下Hive脚本在此路径上创建一个外部表：

CREATE external TABLE avro_location(LOCATION_ID string, STREET1 string, DISPLAY_NAME string)
  ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
  STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
  OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
  location
  '/apps/hive/warehouse/loc_avro_import';

该表也已成功创建，但当我尝试使用配置单元shell重试记录时，它返回的行数与我在Oracle中执行自由格式查询时返回的行数相同。但所有行的数据都为空。我还尝试使用以下命令查看配置单元的信息日志：

hive --hiveconf hive.root.logger=INFO,console

下面是我得到的输出：

hive> select * from avro_location;
15/10/23 15:12:02 [main]: INFO log.PerfLogger: <PERFLOG method=Driver.run from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO log.PerfLogger: <PERFLOG method=TimeToSubmit from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO log.PerfLogger: <PERFLOG method=compile from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO log.PerfLogger: <PERFLOG method=parse from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO parse.ParseDriver: Parsing command: select * from avro_location
15/10/23 15:12:02 [main]: INFO parse.ParseDriver: Parse Completed
15/10/23 15:12:02 [main]: INFO log.PerfLogger: </PERFLOG method=parse start=1445627522004 end=1445627522004 duration=0 from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO log.PerfLogger: <PERFLOG method=semanticAnalyze from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO parse.CalcitePlanner: Starting Semantic Analysis
15/10/23 15:12:02 [main]: INFO parse.CalcitePlanner: Completed phase 1 of Semantic Analysis
15/10/23 15:12:02 [main]: INFO parse.CalcitePlanner: Get metadata for source tables
15/10/23 15:12:02 [main]: INFO parse.CalcitePlanner: Get metadata for subqueries
15/10/23 15:12:02 [main]: INFO parse.CalcitePlanner: Get metadata for destination tables
15/10/23 15:12:02 [main]: INFO ql.Context: New scratch dir is hdfs://sandbox.hortonworks.com:8020/tmp/hive/root/061a4722-0a70-4c28-8b5c-1bf82b63d09f/hive_2015-10-23_15-12-02_004_2341151357389322335-1
15/10/23 15:12:02 [main]: INFO parse.CalcitePlanner: Completed getting MetaData in Semantic Analysis
15/10/23 15:12:02 [main]: INFO parse.BaseSemanticAnalyzer: Not invoking CBO because the statement has too few joins
15/10/23 15:12:02 [main]: INFO avro.AvroSerDe: columnComments is 
15/10/23 15:12:02 [main]: INFO avro.AvroSerDe: Avro schema is {"type":"record","name":"avro_location","namespace":"default","fields":[{"name":"location_id","type":["null","string"],"default":null},{"name":"street1","type":["null","string"],"default":null},{"name":"display_name","type":["null","string"],"default":null}]}
15/10/23 15:12:02 [main]: INFO common.FileUtils: Creating directory if it doesn't exist: hdfs://sandbox.hortonworks.com:8020/tmp/hive/root/061a4722-0a70-4c28-8b5c-1bf82b63d09f/hive_2015-10-23_15-12-02_004_2341151357389322335-1/-mr-10000/.hive-staging_hive_2015-10-23_15-12-02_004_2341151357389322335-1
15/10/23 15:12:02 [main]: INFO parse.CalcitePlanner: Set stats collection dir : hdfs://sandbox.hortonworks.com:8020/tmp/hive/root/061a4722-0a70-4c28-8b5c-1bf82b63d09f/hive_2015-10-23_15-12-02_004_2341151357389322335-1/-mr-10000/.hive-staging_hive_2015-10-23_15-12-02_004_2341151357389322335-1/-ext-10002
15/10/23 15:12:02 [main]: INFO ppd.OpProcFactory: Processing for FS(2)
15/10/23 15:12:02 [main]: INFO ppd.OpProcFactory: Processing for SEL(1)
15/10/23 15:12:02 [main]: INFO ppd.OpProcFactory: Processing for TS(0)
15/10/23 15:12:02 [main]: INFO parse.CalcitePlanner: Completed plan generation
15/10/23 15:12:02 [main]: INFO ql.Driver: Semantic Analysis Completed
15/10/23 15:12:02 [main]: INFO log.PerfLogger: </PERFLOG method=semanticAnalyze start=1445627522005 end=1445627522040 duration=35 from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO avro.AvroSerDe: columnComments is 
15/10/23 15:12:02 [main]: INFO avro.AvroSerDe: Avro schema is {"type":"record","name":"avro_location","namespace":"default","fields":[{"name":"location_id","type":["null","string"],"default":null},{"name":"street1","type":["null","string"],"default":null},{"name":"display_name","type":["null","string"],"default":null}]}
15/10/23 15:12:02 [main]: INFO exec.TableScanOperator: Initializing operator TS[0]
15/10/23 15:12:02 [main]: INFO exec.TableScanOperator: Initialization Done 0 TS done is reset.
15/10/23 15:12:02 [main]: INFO exec.TableScanOperator: Operator 0 TS initialized
15/10/23 15:12:02 [main]: INFO exec.TableScanOperator: Initializing children of 0 TS
15/10/23 15:12:02 [main]: INFO exec.SelectOperator: Initializing child 1 SEL
15/10/23 15:12:02 [main]: INFO exec.SelectOperator: Initializing operator SEL[1]
15/10/23 15:12:02 [main]: INFO exec.SelectOperator: SELECT struct<location_id:string,street1:string,display_name:string>
15/10/23 15:12:02 [main]: INFO exec.SelectOperator: Initialization Done 1 SEL done is reset.
15/10/23 15:12:02 [main]: INFO exec.SelectOperator: Operator 1 SEL initialized
15/10/23 15:12:02 [main]: INFO exec.SelectOperator: Initializing children of 1 SEL
15/10/23 15:12:02 [main]: INFO exec.ListSinkOperator: Initializing child 3 OP
15/10/23 15:12:02 [main]: INFO exec.ListSinkOperator: Initializing operator OP[3]
15/10/23 15:12:02 [main]: INFO exec.ListSinkOperator: Initialization Done 3 OP done is reset.
15/10/23 15:12:02 [main]: INFO exec.ListSinkOperator: Operator 3 OP initialized
15/10/23 15:12:02 [main]: INFO ql.Driver: Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:avro_location.location_id, type:string, comment:null), FieldSchema(name:avro_location.street1, type:string, comment:null), FieldSchema(name:avro_location.display_name, type:string, comment:null)], properties:null)
15/10/23 15:12:02 [main]: INFO log.PerfLogger: </PERFLOG method=compile start=1445627522003 end=1445627522041 duration=38 from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO log.PerfLogger: <PERFLOG method=acquireReadWriteLocks from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO lockmgr.DbTxnManager: Setting lock request transaction to txnid:0 for queryId=root_20151023151202_5e68efe1-1176-485b-9014-301c99198012
15/10/23 15:12:02 [main]: INFO lockmgr.DbLockManager: Requesting: queryId=root_20151023151202_5e68efe1-1176-485b-9014-301c99198012 LockRequest(component:[LockComponent(type:SHARED_READ, level:TABLE, dbname:default, tablename:avro_location)], txnid:0, user:root, hostname:ip-sandbox.hortonworks.com)
15/10/23 15:12:02 [main]: INFO lockmgr.DbLockManager: Response to queryId=root_20151023151202_5e68efe1-1176-485b-9014-301c99198012 LockResponse(lockid:78, state:ACQUIRED)
15/10/23 15:12:02 [main]: INFO log.PerfLogger: </PERFLOG method=acquireReadWriteLocks start=1445627522041 end=1445627522050 duration=9 from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO log.PerfLogger: <PERFLOG method=Driver.execute from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO ql.Driver: Starting command(queryId=root_20151023151202_5e68efe1-1176-485b-9014-301c99198012): select * from avro_location
15/10/23 15:12:02 [main]: INFO hooks.ATSHook: Created ATS Hook
15/10/23 15:12:02 [main]: INFO log.PerfLogger: <PERFLOG method=PreHook.org.apache.hadoop.hive.ql.hooks.ATSHook from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO log.PerfLogger: </PERFLOG method=PreHook.org.apache.hadoop.hive.ql.hooks.ATSHook start=1445627522050 end=1445627522050 duration=0 from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO log.PerfLogger: </PERFLOG method=TimeToSubmit start=1445627522003 end=1445627522050 duration=47 from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO log.PerfLogger: <PERFLOG method=runTasks from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO log.PerfLogger: </PERFLOG method=runTasks start=1445627522051 end=1445627522051 duration=0 from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO hooks.ATSHook: Created ATS Hook
15/10/23 15:12:02 [main]: INFO log.PerfLogger: <PERFLOG method=PostHook.org.apache.hadoop.hive.ql.hooks.ATSHook from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO log.PerfLogger: </PERFLOG method=PostHook.org.apache.hadoop.hive.ql.hooks.ATSHook start=1445627522051 end=1445627522051 duration=0 from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO log.PerfLogger: </PERFLOG method=Driver.execute start=1445627522050 end=1445627522051 duration=1 from=org.apache.hadoop.hive.ql.Driver>
OK
15/10/23 15:12:02 [main]: INFO ql.Driver: OK
15/10/23 15:12:02 [main]: INFO log.PerfLogger: <PERFLOG method=releaseLocks from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO log.PerfLogger: </PERFLOG method=releaseLocks start=1445627522052 end=1445627522118 duration=66 from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO log.PerfLogger: </PERFLOG method=Driver.run start=1445627522003 end=1445627522118 duration=115 from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO mapred.FileInputFormat: Total input paths to process : 1
15/10/23 15:12:02 [main]: INFO avro.AvroGenericRecordReader: Found the avro schema in the job: {"type":"record","name":"avro_location","namespace":"default","fields":[{"name":"location_id","type":["null","string"],"default":null},{"name":"street1","type":["null","string"],"default":null},{"name":"display_name","type":["null","string"],"default":null}]}
15/10/23 15:12:02 [main]: INFO avro.AvroDeserializer: Adding new valid RRID :678ef2a1:150961947b5:-7fff
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
15/10/23 15:12:02 [main]: INFO exec.TableScanOperator: 0 finished. closing... 
15/10/23 15:12:02 [main]: INFO exec.SelectOperator: 1 finished. closing... 
15/10/23 15:12:02 [main]: INFO exec.ListSinkOperator: 3 finished. closing... 
15/10/23 15:12:02 [main]: INFO exec.ListSinkOperator: 3 Close done
15/10/23 15:12:02 [main]: INFO exec.SelectOperator: 1 Close done
15/10/23 15:12:02 [main]: INFO exec.TableScanOperator: 0 Close done
Time taken: 0.115 seconds, Fetched: 30 row(s)
15/10/23 15:12:02 [main]: INFO CliDriver: Time taken: 0.115 seconds, Fetched: 30 row(s)
15/10/23 15:12:02 [main]: INFO log.PerfLogger: <PERFLOG method=releaseLocks from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO log.PerfLogger: </PERFLOG method=releaseLocks start=1445627522136 end=1445627522136 duration=0 from=org.apache.hadoop.hive.ql.Driver>

如上所述，它获取30行，但所有值都为空。请任何人帮助我如何解决这个问题。

共有2个答案

姬弘文

2023-03-14

SQOOP导入生成本地的Avsc文件...将其复制到模式文件夹。查询您的表。

龙永思

2023-03-14

我认为需要像这样为表指定avro架构位置：TBLPROPERTIES（'avro.schema.url=“”/apps/hive/warehouse/loc\u avro\u import/schemaname.avsc”）；

Sqoop导入为Avro数据文件在Hive中创建外部Avro表时将所有值设置为NULL

共有2个答案

相关问答

相关文章

相关阅读

相关工具

相关文档