当前位置: 首页 > 知识库问答 >
问题:

如何在pyspark 2.0中不使用metastore读取ORC文件

公冶嘉茂
2023-03-14

我想使用pyspark 2.0读取一些ORC文件,但不使用metastore。理论上,这样做是可行的,因为数据模式嵌入在ORC文件中。但我得到的是:

[me@hostname ~]$/usr/local/spark-2.0.0-bin-hadoop2.6/bin/pyspark
Python 2.7.11 (default, Feb 18 2016, 13:54:48)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-16)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.0.0
      /_/

Using Python version 2.7.11 (default, Feb 18 2016 13:54:48)
SparkSession available as 'spark'.
>>> df=spark.read.orc('/my/orc/file')
16/08/21 22:29:38 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/08/21 22:30:00 ERROR metastore.RetryingHMSHandler: AlreadyExistsException(message:Database default already exists)
    at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_database(HiveMetaStore.java:891)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:107)
    at com.sun.proxy.$Proxy21.create_database(Unknown Source)
    at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createDatabase(HiveMetaStoreClient.java:644)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156)
    at com.sun.proxy.$Proxy22.createDatabase(Unknown Source)
    at org.apache.hadoop.hive.ql.metadata.Hive.createDatabase(Hive.java:306)
    at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createDatabase$1.apply$mcV$sp(HiveClientImpl.scala:291)
    at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createDatabase$1.apply(HiveClientImpl.scala:291)
    at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createDatabase$1.apply(HiveClientImpl.scala:291)
    at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:262)
    at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:209)
    at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:208)
    at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:251)
    at org.apache.spark.sql.hive.client.HiveClientImpl.createDatabase(HiveClientImpl.scala:290)
    at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createDatabase$1.apply$mcV$sp(HiveExternalCatalog.scala:99)
    at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createDatabase$1.apply(HiveExternalCatalog.scala:99)
    at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createDatabase$1.apply(HiveExternalCatalog.scala:99)
    at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:72)
    at org.apache.spark.sql.hive.HiveExternalCatalog.createDatabase(HiveExternalCatalog.scala:98)
    at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createDatabase(SessionCatalog.scala:147)
    at org.apache.spark.sql.catalyst.catalog.SessionCatalog.(SessionCatalog.scala:89)
    at org.apache.spark.sql.hive.HiveSessionCatalog.(HiveSessionCatalog.scala:51)
    at org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:49)
    at org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
    at org.apache.spark.sql.hive.HiveSessionState$$anon$1.(HiveSessionState.scala:63)
    at org.apache.spark.sql.hive.HiveSessionState.analyzer$lzycompute(HiveSessionState.scala:63)
    at org.apache.spark.sql.hive.HiveSessionState.analyzer(HiveSessionState.scala:62)
    at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
    at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
    at org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSession.scala:382)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:143)
    at org.apache.spark.sql.DataFrameReader.orc(DataFrameReader.scala:450)
    at org.apache.spark.sql.DataFrameReader.orc(DataFrameReader.scala:439)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:280)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:211)
    at java.lang.Thread.run(Thread.java:745)

>>>

读取ORC文件的正确方法是什么?

共有1个答案

边翔宇
2023-03-14

我解决了问题。虽然pyspark报告了ERROR,但将数据从ORC文件加载到数据帧实际上并没有失败。尽管有错误消息,返回的数据帧可以毫无问题地被引用。

 类似资料:
  • 我正在开发一个Flink流媒体程序,可以读取Kafka消息,并将消息转储到AWS s3上的ORC文件中。我发现没有关于Flink的BucketingSink和ORC file writer整合的文件。BucketingSink中没有这样的ORC文件编写器实现。 我被困在这里了,有什么想法吗?

  • 当我运行以下命令时: 这些列打印为“_col0”、“_col1”、“_col2”等。而不是它们的真实名称,如“empno”、“name”、“Deptno”。 当我在Hive中“description mytable”时,它会正确打印列名,但当我运行“orcfiledump”时,它也会显示\u col0、\u col1、\u col2。我必须指定“schema on read”或其他什么吗?如果是,

  • 问题内容: 如何将a转换为a ? 问题答案: 这取决于最适合您的方式。明智地提高生产力,不要重蹈覆辙,而是使用Apache Commons。在哪。

  • 问题内容: 是否可以在AngularJS中读取文件?我想将文件放入HTML5画布进行裁剪。 我在考虑使用指令吗?这是我要放入指令中的javascript代码: 问题答案: 是的,指令是正确的方法,但看起来有些不同: 工作示例:http : //plnkr.co/edit/y5n16v?p=preview 感谢lalalalalmbda提供此链接。

  • 问题内容: 我试图将文本文件加载到我的JavaScript文件中,然后从该文件中读取行以获取信息,我尝试使用FileReader,但它似乎无法正常工作。有人可以帮忙吗? 问题答案: 是的,可以使用FileReader,我已经做了一个示例,这是代码: 最后,我只是读了其他一些吸引我的答案,但正如他们所建议的那样,您可能正在寻找使您能够从JavaScript文件所在的服务器(或设备)加载文本文件的代码

  • 我想从文本文件中读取文本。在下面的代码中,会发生异常(这意味着它会转到catch块)。我将文本文件放在应用程序文件夹中。我应该把这个文本文件(mani.txt)放在哪里才能正确阅读?