当前位置: 首页 > 知识库问答 >
问题:

AWS S3:Spark-java。lang.IllegalArgumentException:URI不是绝对的。。。将数据帧作为json保存到s3位置时

羊舌子瑜
2023-03-14

在将数据帧保存到AWS S3时,我遇到了奇怪的错误。

 df.coalesce(1).write.mode(SaveMode.Overwrite)
      .json(s"s3://myawsacc/results/")

在同一个位置,我可以插入spark shell的数据。并且正在工作。。。

 spark.sparkContext.parallelize(1 to 4).toDF.write.mode(SaveMode.Overwrite)
          .format("com.databricks.spark.csv")
          .save(s"s3://myawsacc/results/")

我的问题是为什么它在spack-shell中工作,而不是通过spack-提交工作?对此有任何逻辑/属性/配置吗?

Exception in thread "main" java.lang.ExceptionInInitializerError
       at com.amazon.ws.emr.hadoop.fs.s3n.S3Credentials.initialize(S3Credentials.java:45)
       at com.amazon.ws.emr.hadoop.fs.HadoopConfigurationAWSCredentialsProvider.(HadoopConfigurationAWSCredentialsProvider.java:26)
       at com.amazon.ws.emr.hadoop.fs.guice.DefaultAWSCredentialsProviderFactory.getAwsCredentialsProviderChain(DefaultAWSCredentialsProviderFactory.java:44)
       at com.amazon.ws.emr.hadoop.fs.guice.DefaultAWSCredentialsProviderFactory.getAwsCredentialsProvider(DefaultAWSCredentialsProviderFactory.java:28)
       at com.amazon.ws.emr.hadoop.fs.guice.EmrFSProdModule.getAwsCredentialsProvider(EmrFSProdModule.java:70)
       at com.amazon.ws.emr.hadoop.fs.guice.EmrFSProdModule.createS3Configuration(EmrFSProdModule.java:86)
       at com.amazon.ws.emr.hadoop.fs.guice.EmrFSProdModule.createAmazonS3LiteClient(EmrFSProdModule.java:80)
       at com.amazon.ws.emr.hadoop.fs.guice.EmrFSProdModule.createAmazonS3Lite(EmrFSProdModule.java:120)
       at com.amazon.ws.emr.hadoop.fs.guice.EmrFSBaseModule.provideAmazonS3Lite(EmrFSBaseModule.java:99)
       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
       at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
       at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
       at java.lang.reflect.Method.invoke(Method.java:498)
       at com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.ProviderMethod.get(ProviderMethod.java:104)
       at com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.InternalFactoryToProviderAdapter.get(InternalFactoryToProviderAdapter.java:40)
       at com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.ProviderToInternalFactoryAdapter$1.call(ProviderToInternalFactoryAdapter.java:46)
       at com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.InjectorImpl.callInContext(InjectorImpl.java:1031)
       at com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.ProviderToInternalFactoryAdapter.get(ProviderToInternalFactoryAdapter.java:40)
       at com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.Scopes$1$1.get(Scopes.java:65)
       at com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.InternalFactoryToProviderAdapter.get(InternalFactoryToProviderAdapter.java:40)
       at com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.SingleFieldInjector.inject(SingleFieldInjector.java:53)
       at com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.MembersInjectorImpl.injectMembers(MembersInjectorImpl.java:110)
       at com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.ConstructorInjector.construct(ConstructorInjector.java:94)
       at com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.ConstructorBindingImpl$Factory.get(ConstructorBindingImpl.java:254)
       at com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.FactoryProxy.get(FactoryProxy.java:54)
       at com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.InjectorImpl$4$1.call(InjectorImpl.java:978)
       at com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.InjectorImpl.callInContext(InjectorImpl.java:1024)
       at com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.InjectorImpl$4.get(InjectorImpl.java:974)
       at com.amazon.ws.emr.hadoop.fs.shaded.com.google.inject.internal.InjectorImpl.getInstance(InjectorImpl.java:1009)
       at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.initialize(EmrFileSystem.java:103)
       at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2717)
       at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:93)
       at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2751)
       at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2733)
       at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:377)
       at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
       at org.apache.spark.sql.execution.datasources.DataSource.writeInFileFormat(DataSource.scala:394)
       at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:471)
       at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:50)
       at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
       at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
       at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
       at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
       at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
       at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
       at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
       at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
       at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
       at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
       at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
       at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:609)
       at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)
       at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:217)
       at org.apache.spark.sql.DataFrameWriter.json(DataFrameWriter.scala:487)
       at com.org.ComparatorUtil$.writeLogNError(ComparatorUtil.scala:245)
       at com.org.ComparatorUtil$.writeToJson(ComparatorUtil.scala:161)
       at com.org.comparator.SnowFlakeTableComparator$.mainExecutor(SnowFlakeTableComparator.scala:98)
       at com.org.executor.myclass$$anonfun$main$4$$anonfun$apply$1.apply(myclass.scala:232)
       at com.org.executor.myclass$$anonfun$main$4$$anonfun$apply$1.apply(myclass.scala:153)
       at scala.collection.Iterator$class.foreach(Iterator.scala:893)
       at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
       at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
       at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
       at com.org.executor.myclass$$anonfun$main$4.apply(myclass.scala:153)
       at com.org.executor.myclass$$anonfun$main$4.apply(myclass.scala:134)
       at scala.collection.immutable.List.foreach(List.scala:381)
       at com.org.executor.myclass$.main(myclass.scala:134)
       at com.org.executor.myclass.main(myclass.scala)
       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
       at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
       at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
       at java.lang.reflect.Method.invoke(Method.java:498)
       at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:775)
       at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
       at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
       at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
       at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
   Caused by: java.lang.IllegalArgumentException: URI is not absolute
           at java.net.URI.toURL(URI.java:1088)
           at org.apache.hadoop.fs.http.AbstractHttpFileSystem.open(AbstractHttpFileSystem.java:60)
           at org.apache.hadoop.fs.http.HttpFileSystem.open(HttpFileSystem.java:23)
           at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:773)
           at org.apache.hadoop.fs.FsUrlConnection.connect(FsUrlConnection.java:50)
           at org.apache.hadoop.fs.FsUrlConnection.getInputStream(FsUrlConnection.java:59)
           at java.net.URL.openStream(URL.java:1045)
           at com.amazon.ws.emr.hadoop.fs.shaded.com.fasterxml.jackson.core.JsonFactory._optimizedStreamFromURL(JsonFactory.java:1479)
           at com.amazon.ws.emr.hadoop.fs.shaded.com.fasterxml.jackson.core.JsonFactory.createParser(JsonFactory.java:779)
           at com.amazon.ws.emr.hadoop.fs.shaded.com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2679)
           at com.amazon.ws.emr.hadoop.fs.util.PlatformInfo.getClusterIdFromConfigurationEndpoint(PlatformInfo.java:39)
           at com.amazon.ws.emr.hadoop.fs.util.PlatformInfo.getJobFlowId(PlatformInfo.java:53)
           at com.amazon.ws.emr.hadoop.fs.util.EmrFsUtils.getJobFlowId(EmrFsUtils.java:384)
           at com.amazon.ws.emr.hadoop.fs.util.EmrFsUtils.(EmrFsUtils.java:60)
           ... 77 more
           

共有1个答案

傅俊德
2023-03-14
import java.net.URI

import spark.implicits._

    spark.sparkContext.parallelize(1 to 4).toDF
      .coalesce(1)
      .write.mode(SaveMode.Overwrite)
      .json(new URI("s3://myawsacc/results/").toString)

    spark.sparkContext.parallelize(1 to 4).toDF
      .coalesce(1)
      .write.mode(SaveMode.Overwrite)
      .json(URI.create("s3://myawsacc/results/").toString)

对我来说很好。

看起来像spack-shell隐含地应用new URIURI. create,因此它工作正常。

 类似资料:
  • 我有两个Spark的数据帧。其中一个是使用HiveContext从配置单元表接收的: 第一个数据帧保存时没有出现问题,但当我尝试以同样的方式保存第二个数据帧()时,我得到了这个错误 文件“/home/jup-user/testdb/scripts/caching.py”,第90行,spark_df_test.write.mode(“overwrite”).format(“orc”).saveAsT

  • 问题内容: 我目前利用以下内容将文件上传到S3: 上面的方法工作正常,但我想直接将a保存到S3以从应用程序中删除几秒钟,但是我不知道如何执行此操作?这是我当前将图像保存到文件中的方式: 有没有一种方法可以直接以流的形式直接写入Amazon S3,如果可以,有人可以显示示例吗? 另外,这是个好主意吗?如果它容易出错,我将继续使用当前方法。任何建议表示赞赏。 问题答案: 以下(或类似的东西)应该可以正

  • 我正在处理一个包含uni_key和createdDate两列的数据帧。我运行一个SQL查询并将结果保存到中,现在我想将这些结果保存到csv文件中。有什么方法可以做到这一点吗?这是一个代码片段: 此代码当前出现以下错误: AttributeError:“DataFrameWriter”对象没有属性“csv”

  • 我在Azure数据库中使用笔记本创建了一个简单的工作。我试图保存一个火花数据帧从笔记本到Azure Blob存储。附上样本代码 当我在本地机器上运行spark submit时,上面的代码起作用。使用的spark submit命令是 Spark-提交-主本地[1]-包org.apache.hadoop: hadoop-Azure: 2.7.2,com.microsoft.azure: azure-存

  • 问题内容: 我正在尝试将ID为“ absPos”的div相对于其父div放在绝对位置。但它不起作用,div放置在页面的左上角。 我的代码示例如下 您能帮我解决这个问题吗?在我的实际情况下,我必须放置背景图像,而不是红色背景色。 问候 问题答案: 绝对定位的元素从其最近的祖先开始定位。在您的代码中,祖先都不是“定位”元素,因此div从body元素(即)偏移。 解决方案是将其应用于父div,这迫使它成

  • (希望有人能帮我解决这个问题)非常感谢!!