当前位置: 首页 > 知识库问答 >
问题:

Py4JJavaError:调用o389时出错。当试图将rdd数据框作为拼花文件写入本地目录时

段干兴业
2023-03-14

我正试图使用Jupyter笔记本中的以下代码将数据框写入本地目录中的拼花地板文件:

rdd1 = rdd.coalesce(partitions)

schema1 = StructType([StructField('date', DateType()), StructField('open', FloatType()), StructField('high', FloatType()),
           StructField('low', FloatType()),StructField('close', FloatType()),StructField('adj_close', FloatType()),
           StructField('volume', FloatType()), StructField('stock', StringType())])

rddDF = spark.createDataFrame(rdd1,schema=schema1)

spark.conf.set("spark.sql.parquet.compression.codec", "gzip")

rddDF.write.parquet("C:/Users/"User"/Documents/File/Output/rddDF")

我得到以下错误:

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-11-7b2aeb627267> in <module>
    16 
    17 #rddDF.to_parquet("C:/Users/Sabihah/Documents/6. Processing Big Data/Output/rddDF")
---> 18 rddDF.write.parquet("C:/Users/Sabihah/Documents/6. Processing Big Data/Output/rddDF")
    19 #rddDF.write.format("parquet").save("C:/Users/Sabihah/Documents/6. Processing Big Data/Output/rddDF")

~\anaconda3\lib\site-packages\pyspark\sql\readwriter.py in parquet(self, path, mode, partitionBy, compression)
   883             self.partitionBy(partitionBy)
   884         self._set_opts(compression=compression)
--> 885         self._jwrite.parquet(path)
   886 
   887     def text(self, path, compression=None, lineSep=None):

~\anaconda3\lib\site-packages\py4j\java_gateway.py in __call__(self, *args)
  1307 
  1308         answer = self.gateway_client.send_command(command)
-> 1309         return_value = get_return_value(
  1310             answer, self.gateway_client, self.target_id, self.name)
  1311 

~\anaconda3\lib\site-packages\pyspark\sql\utils.py in deco(*a, **kw)
   109     def deco(*a, **kw):
   110         try:
--> 111             return f(*a, **kw)
   112         except py4j.protocol.Py4JJavaError as e:
   113             converted = convert_exception(e.java_exception)

~\anaconda3\lib\site-packages\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name)
   324             value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
   325             if answer[1] == REFERENCE_TYPE:
--> 326                 raise Py4JJavaError(
   327                     "An error occurred while calling {0}{1}{2}.\n".
   328                     format(target_id, ".", name), value)

Py4JJavaError: An error occurred while calling o48.parquet.
: org.apache.spark.SparkException: Job aborted.
   at org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:496)
   at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:251)
   at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:186)
   at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:113)
   at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:111)
   at org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:125)
   at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:110)
   at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
   at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
   at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
   at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
   at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:110)
   at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:106)
   at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)
   at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
   at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481)
   at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
   at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
   at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
   at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
   at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
   at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:457)
   at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:106)
   at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:93)
   at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:91)
   at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:128)
   at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:848)
   at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:382)
   at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:355)
   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239)
   at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:781)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
   at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
   at java.lang.reflect.Method.invoke(Unknown Source)
   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
   at py4j.Gateway.invoke(Gateway.java:282)
   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
   at py4j.commands.CallCommand.execute(CallCommand.java:79)
   at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
   at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
   at java.lang.Thread.run(Unknown Source)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 1 times, most recent failure: Lost task 1.0 in stage 0.0 (TID 1) (DESKTOP-JBUENQG executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
 File "C:\Spark\Spark\python\lib\pyspark.zip\pyspark\worker.py", line 619, in main
 File "C:\Spark\Spark\python\lib\pyspark.zip\pyspark\worker.py", line 611, in process
 File "C:\Spark\Spark\python\lib\pyspark.zip\pyspark\serializers.py", line 259, in dump_stream
   vs = list(itertools.islice(iterator, batch))
 File "C:\Spark\Spark\python\lib\pyspark.zip\pyspark\util.py", line 74, in wrapper
   return f(*args, **kwargs)
 File "C:\Users\Sabihah\anaconda3\lib\site-packages\pyspark\sql\session.py", line 682, in prepare
   verify_func(obj)
 File "C:\Users\Sabihah\anaconda3\lib\site-packages\pyspark\sql\types.py", line 1411, in verify
   verify_value(obj)
 File "C:\Users\Sabihah\anaconda3\lib\site-packages\pyspark\sql\types.py", line 1398, in verify_struct
   raise TypeError(new_msg("StructType can not accept object %r in type %s"
TypeError: StructType can not accept object 'close' in type <class 'str'>

   at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:545)
   at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:703)
   at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:685)
   at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:498)
   at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
   at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
   at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
   at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
   at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:286)
   at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$16(FileFormatWriter.scala:229)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
   at org.apache.spark.scheduler.Task.run(Task.scala:131)
   at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
   at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
   at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
   at java.lang.Thread.run(Unknown Source)

Driver stacktrace:
   at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2403)
   at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2352)
   at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2351)
   at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
   at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
   at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2351)
   at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1109)
   at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1109)
   at scala.Option.foreach(Option.scala:407)
   at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1109)
   at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2591)
   at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2533)
   at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2522)
   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
   at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:898)
   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2214)
   at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:218)
   ... 42 more
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
 File "C:\Spark\Spark\python\lib\pyspark.zip\pyspark\worker.py", line 619, in main
 File "C:\Spark\Spark\python\lib\pyspark.zip\pyspark\worker.py", line 611, in process
 File "C:\Spark\Spark\python\lib\pyspark.zip\pyspark\serializers.py", line 259, in dump_stream
   vs = list(itertools.islice(iterator, batch))
 File "C:\Spark\Spark\python\lib\pyspark.zip\pyspark\util.py", line 74, in wrapper
   return f(*args, **kwargs)
 File "C:\Users\Sabihah\anaconda3\lib\site-packages\pyspark\sql\session.py", line 682, in prepare
   verify_func(obj)
 File "C:\Users\Sabihah\anaconda3\lib\site-packages\pyspark\sql\types.py", line 1411, in verify
   verify_value(obj)
 File "C:\Users\Sabihah\anaconda3\lib\site-packages\pyspark\sql\types.py", line 1398, in verify_struct
   raise TypeError(new_msg("StructType can not accept object %r in type %s"
TypeError: StructType can not accept object 'close' in type <class 'str'>

   at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:545)
   at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:703)
   at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:685)
   at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:498)
   at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
   at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
   at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
   at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
   at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:286)
   at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$16(FileFormatWriter.scala:229)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
   at org.apache.spark.scheduler.Task.run(Task.scala:131)
   at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
   at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
   at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
   ... 1 more

我检查了所有系统变量:Hadoop_home、Java_home、Spark_home、Scala_home、Pyspark_python、Pyspark_driver_python。

我已经使用Hadoop v2.7和Scala 2.12.4安装了Spark v3.2,更新到v2.12.10。我在笔记本中使用Python 3.8。

我尝试降级到Python 3.7,但这并没有解决问题。

我不知道还能尝试什么来修复这个错误。任何帮助都将不胜感激。

更新:我试图修复数据类型,但错误仍然存在。

然后,我对创建数据帧的方式做了以下更改:

rddDF = spark.createDataFrame([rdd1],schema=schema1)

删除了TypeError:

StructType can not accept object 'close' in type <class 'str'>

现在我的错误显示:

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-23-ec6844f98c97> in <module>
     16 
     17 #rddDF.to_parquet("C:/Users/Sabihah/Documents/6. Processing Big Data/Output/rddDF")
---> 18 rddDF.write.parquet("C:/Users/Sabihah/Documents/6. Processing Big Data/Output/rddDF")
     19 #rddDF.write.format("parquet").save("C:/Users/Sabihah/Documents/6. Processing Big Data/Output/rddDF")

~\anaconda3\lib\site-packages\pyspark\sql\readwriter.py in parquet(self, path, mode, partitionBy, compression)
    883             self.partitionBy(partitionBy)
    884         self._set_opts(compression=compression)
--> 885         self._jwrite.parquet(path)
    886 
    887     def text(self, path, compression=None, lineSep=None):

~\anaconda3\lib\site-packages\py4j\java_gateway.py in __call__(self, *args)
   1307 
   1308         answer = self.gateway_client.send_command(command)
-> 1309         return_value = get_return_value(
   1310             answer, self.gateway_client, self.target_id, self.name)
   1311 

~\anaconda3\lib\site-packages\pyspark\sql\utils.py in deco(*a, **kw)
    109     def deco(*a, **kw):
    110         try:
--> 111             return f(*a, **kw)
    112         except py4j.protocol.Py4JJavaError as e:
    113             converted = convert_exception(e.java_exception)

~\anaconda3\lib\site-packages\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name)
    324             value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
    325             if answer[1] == REFERENCE_TYPE:
--> 326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
    328                     format(target_id, ".", name), value)

Py4JJavaError: An error occurred while calling o389.parquet.
: org.apache.spark.SparkException: Job aborted.
    at org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:496)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:251)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:186)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:113)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:111)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:125)
    at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:110)
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
    at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:110)
    at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:106)
    at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)
    at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:457)
    at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:106)
    at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:93)
    at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:91)
    at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:128)
    at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:848)
    at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:382)
    at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:355)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239)
    at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:781)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    at java.lang.reflect.Method.invoke(Unknown Source)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
    at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
    at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
    at org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Native Method)
    at org.apache.hadoop.io.nativeio.NativeIO$Windows.access(NativeIO.java:793)
    at org.apache.hadoop.fs.FileUtil.canRead(FileUtil.java:1215)
    at org.apache.hadoop.fs.FileUtil.list(FileUtil.java:1420)
    at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:601)
    at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972)
    at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014)
    at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:761)
    at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972)
    at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014)
    at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.getAllCommittedTaskPaths(FileOutputCommitter.java:334)
    at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJobInternal(FileOutputCommitter.java:404)
    at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:377)
    at org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:48)
    at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.commitJob(HadoopMapReduceCommitProtocol.scala:182)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$20(FileFormatWriter.scala:240)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at org.apache.spark.util.Utils$.timeTakenMs(Utils.scala:605)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:240)
    ... 42 more

共有2个答案

轩辕煜
2023-03-14

关于新的错误:

Py4JJavaError: An error occurred while calling o389.parquet.
: org.apache.spark.SparkException: Job aborted.
    at org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:496)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:251)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:186)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:113)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:111)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:125)

我在Windows机器上解决这个问题的方法是将Hadoop的“bin”目录(在Hadoop_HOME指定的目录中)添加到PATH变量中。

当我的HADOOP_主页不存在,或者没有指向正确的位置时,我犯了一个错误(我现在想不起来了)。但在我解决了这个问题后,我遇到了你经历过的那个。在尝试使用7个不同版本的Hadoop文件(winutils.exe和Hadoop.dll)后,我遇到了“路径”问题。当一个OSS项目依赖于一堆其他项目,并且它们都有自己的解决依赖关系的方法时,这就是一种混乱。

我在提升的Powershell提示符下运行了这个程序,重新启动了我的VsCode实例(这是我运行Jupyter笔记本的地方),并且我能够保存拼花地板文件。

[Environment]::SetEnvironmentVariable("PATH", "$env:PATH;$env:HADOOP_HOME\bin", 'Machine')
柳绪
2023-03-14

根据错误:

TypeError: StructType can not accept object 'close' in type <class 'str'>

似乎应该将close列定义为StringType,而不是像您那样定义为FloatType

 类似资料:
  • 我有一个很大的数据框,我正在HDFS中写入拼花文件。从日志中获取以下异常: 谷歌对此进行了搜索,但找不到任何具体的解决方案。将推测设置为false:conf.Set(“spark.投机”,“false”) 但仍然没有帮助。它只完成了几个任务,生成了几个零件文件,然后突然因此错误而停止。 详细信息:Spark版本:2.3.1(这在1.6x中没有发生) 只有一个会话正在运行,这排除了不同会话访问同一位

  • 我试图做一些非常简单的事情,我有一些非常愚蠢的挣扎。我想这一定与对火花的基本误解有关。我非常感谢任何帮助或解释。 我有一张非常大的桌子(~3 TB,~300毫米行,25k个分区),在s3中保存为拼花地板,我想给一些人一个很小的拼花文件样本。不幸的是,这要花很长时间才能完成,我不明白为什么。我尝试了以下方法: 然后当这不起作用时,我尝试了这个,我认为应该是一样的,但我不确定。(我添加了,以尝试调试。

  • 我是PySpark的新手。我一直在用测试样本编写代码。一旦我在更大的文件上运行代码(3gb压缩)。我的代码只做了一些过滤和连接。关于py4J,我一直在出错。 任何帮助都是有益的,我们将不胜感激。 回来 更新:我使用的是py4j 10.7,刚刚更新到10.8 更新(1):添加spark。驾驶员内存: 汇总返回错误: 更新(2):我通过更改spark默认值尝试了这一点。conf文件。仍在获取错误PyS

  • 我正在尝试使用Databricks的spark-csv2.10依赖关系将一个数据帧写入到HDFS的*.csv文件。依赖关系似乎可以正常工作,因为我可以将.csv文件读入数据帧。但是当我执行写操作时,我会得到以下错误。将头写入文件后会出现异常。 当我将查询更改为时,write工作很好。 有谁能帮我一下吗? 编辑:根据Chandan的请求,这里是的结果

  • 我在运行Python 3.6.5的Jupyter笔记本和运行3.7.2的Python shell中出现了这个错误。我的操作系统是Windows10。我在这两种环境中都安装了pip pyspark。两者都使用Spark Version2.4.0,而我的Java JDK是Oracle JDK Version8,JDK1.8.0_201。这是我在这两种情况下运行的代码: 这里:Spyder中的PySpa

  • 当我从Cassandra加载数据以触发Rdd并在Rdd上执行一些操作时,我知道数据将分布到多个节点中。在我的例子中,我希望将这些rdds从每个节点直接写到它的本地Cassandra dB表中,无论如何都要这样做。 如果我进行正常的rdd收集,所有来自spark节点的数据将被合并并返回到带有驱动程序的节点。我不希望这种情况发生,因为数据流从节点返回驱动节点可能需要很长时间,我希望数据直接保存到本地节