我正在由Glue Crawler生成的CSV可数据上运行Glue ETL作业。爬虫点击具有以下结构的目录
s3
-> aggregated output
->datafile1.csv
->datafile2.csv
->datafile3.csv
这些文件被聚合到一个“聚合输出”表中,该表可以在athena中成功查询。
我正在尝试使用AWS Glue ETL作业将其转换为拼花地板文件。作业失败
"py4j.protocol.Py4JJavaError: An error occurred while calling
o92.pyWriteDynamicFrame.
: java.io.IOException: Failed to delete key: parquet-output/_temporary"
我很难找到根本原因
我尝试了多种方式修改Glue作业。我确保分配给该作业的IAM角色有权删除相关存储桶上的文件夹。现在我正在使用AWS提供的默认临时/脚本文件夹。我尝试在我的s3存储桶上使用文件夹,但看到类似的错误
s3://aws-glue-temporary-256967298135-us-east-2/admin
s3://aws-glue-scripts-256967298135-us-east-2/admin/rt-5/13-ETL-Tosat
共享下面的完整堆栈跟踪
19/05/10 17:57:59 INFO client.RMProxy: Connecting to ResourceManager at ip-172-32-29-36.us-east-2.compute.internal/172.32.29.36:8032
Container: container_1557510304861_0001_01_000002 on ip-172-32-1-101.us-east-2.compute.internal_8041
LogType:stdout
Log Upload Time:Fri May 10 17:57:53 +0000 2019
LogLength:0
Log Contents:
End of LogType:stdout
Container: container_1557510304861_0001_01_000001 on ip-172-32-26-232.us-east-2.compute.internal_8041
LogType:stdout
Log Upload Time:Fri May 10 17:57:54 +0000 2019
LogLength:9048
Log Contents:
null_fields []
Traceback (most recent call last):
File "script_2019-05-10-17-47-48.py", line 40, in <module>
datasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options =
{
"path": "s3://changed-s3-path-parquet-output/parquet-output"
}
, format = "parquet", transformation_ctx = "datasink4")
File "/mnt/yarn/usercache/root/appcache/application_1557510304861_0001/container_1557510304861_0001_01_000001/PyGlue.zip/awsglue/dynamicframe.py", line 585, in from_options
File "/mnt/yarn/usercache/root/appcache/application_1557510304861_0001/container_1557510304861_0001_01_000001/PyGlue.zip/awsglue/context.py", line 193, in write_dynamic_frame_from_options
File "/mnt/yarn/usercache/root/appcache/application_1557510304861_0001/container_1557510304861_0001_01_000001/PyGlue.zip/awsglue/context.py", line 216, in write_from_options
File "/mnt/yarn/usercache/root/appcache/application_1557510304861_0001/container_1557510304861_0001_01_000001/PyGlue.zip/awsglue/data_sink.py", line 32, in write
File "/mnt/yarn/usercache/root/appcache/application_1557510304861_0001/container_1557510304861_0001_01_000001/PyGlue.zip/awsglue/data_sink.py", line 28, in writeFrame
File "/mnt/yarn/usercache/root/appcache/application_1557510304861_0001/container_1557510304861_0001_01_000001/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/mnt/yarn/usercache/root/appcache/application_1557510304861_0001/container_1557510304861_0001_01_000001/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/mnt/yarn/usercache/root/appcache/application_1557510304861_0001/container_1557510304861_0001_01_000001/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o92.pyWriteDynamicFrame.
: java.io.IOException: Failed to delete key: parquet-output/_temporary
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.delete(S3NativeFileSystem.java:689)
at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.delete(EmrFileSystem.java:296)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.cleanupJob(FileOutputCommitter.java:463)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.abortJob(FileOutputCommitter.java:482)
at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.abortJob(HadoopMapReduceCommitProtocol.scala:156)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:212)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:166)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:166)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:166)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:145)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
at org.apache.spark.sql.execution.datasources.DataSource.writeInFileFormat(DataSource.scala:435)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:471)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:50)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:609)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)
at com.amazonaws.services.glue.SparkSQLDataSink.writeDynamicFrame(DataSink.scala:123)
at com.amazonaws.services.glue.DataSink.pyWriteDynamicFrame(DataSink.scala:38)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: 1 exceptions thrown from 1 batch deletes
at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.deleteAll(Jets3tNativeFileSystemStore.java:375)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy56.deleteAll(Unknown Source)
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.doSingleThreadedBatchDelete(S3NativeFileSystem.java:1369)
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.delete(S3NativeFileSystem.java:687)
... 50 more
Caused by: java.io.IOException: MultiObjectDeleteException thrown with 26 keys in error: parquet-output/_temporary/0/_temporary/attempt_20190510164114_0000_m_000001_1/part-00001-7288591a-dd6f-404a-9c24-1dd70b3ff4ff-c000.snappy.parquet, parquet-output/_temporary/0/_temporary/attempt_20190510164315_0000_m_000001_2/part-00001-7288591a-dd6f-404a-9c24-1dd70b3ff4ff-c000.snappy.parquet, parquet-output/_temporary/0/_temporary_$folder$
at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.deleteAll(Jets3tNativeFileSystemStore.java:360)
... 59 more
Caused by: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.MultiObjectDeleteException: One or more objects could not be deleted (Service: null; Status Code: 200; Error Code: null; Request ID: 071747B6992918B9), S3 Extended Request ID: oTBHx76MMI70zuD86AWeq4+rXa3GalW6ptQFM91ceQwhB9SBJV9Z6qw1yG2Ar2DBOr06soLL5dE=
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.deleteObjects(AmazonS3Client.java:2084)
at com.amazon.ws.emr.hadoop.fs.s3.lite.call.DeleteObjectsCall.perform(DeleteObjectsCall.java:25)
at com.amazon.ws.emr.hadoop.fs.s3.lite.call.DeleteObjectsCall.perform(DeleteObjectsCall.java:11)
at com.amazon.ws.emr.hadoop.fs.s3.lite.executor.GlobalS3Executor.execute(GlobalS3Executor.java:80)
at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.invoke(AmazonS3LiteClient.java:176)
at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.deleteObjects(AmazonS3LiteClient.java:125)
at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.deleteAll(Jets3tNativeFileSystemStore.java:355)
... 59 more
End of LogType:stdout
今天遇到同样的问题,结果表明,不仅需要对用于粘合作业的IAM角色策略中的相应存储桶/路径具有读/写权限,还需要s3:DeleteObject,以便作业可以清理临时目录。错误消息基本上说,它不是在写入时失败,而是在删除步骤失败。
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:DeleteObject"
],
"Resource": [
"arn:aws:s3:::target-bucket/*"
]
}
]
}
通过创建具有S3和AWSGlueServiceRole权限的新IAM角色修复了此问题
我想在阿兹卡班经营蜂巢工作
我目前正在尝试为一个项目设置Elasticsearch。我已经安装了,还安装了Java,即。 但是当我尝试使用以下命令启动Elasticsearch时 我得到以下错误 loaded:loaded(/usr/lib/systemd/system/elasticsearch.service;disabled;vend 活动:自世界协调时2019-11-01 06:09:54开始失败(结果:退出-代码)
在MySQL中,我想删除一个表。 我尝试了很多方法,但我总是得到一个错误,即不能删除名为的表。这是我得到的错误: #1217-无法删除或更新父行:外键约束失败 我怎么放下这张桌子?
我正在通过rest上传视频到我们的Azure媒体服务器,但编码工作失败,以下例外: 我可以看到它声明不支持文件类型,但是如果我手动上传它就没有问题了。 这就是我发布视频的方式 该文件存在于Azure服务器上,但无法播放。 谁能给我指个方向吗
我正在写一个简单的流媒体地图减少工作使用Python在亚马逊电子病历上运行。它基本上是用户记录的聚合器,将每个用户标识的条目分组在一起。 制图器 减速机: 此作业应在包含五个文本文件的目录上运行。EMR作业的参数包括: 输入:[桶名]/[输入文件夹名] 输出:[存储桶名称]/Output 映射器:[Bucket name]/Mapper.py Reducer:[存储桶名称]/Reducer.py
我得到了“ExecutorLostFailure(Executor1 lost)”。 我已经尝试了大部分的Spark调优配置。我已经减少到一个执行人失去,因为最初我得到了像6个执行人失败。 以下是我的配置(我的spark-submit):