当前位置: 首页 > 知识库问答 >
问题:

pyspark Cassandra:写入语句失败

公羊兴文
2023-03-14

我正在尝试通过PySpark向cassandra表写入两行。我使用datastax cassandra连接器,方法是使用以下命令启动PySpark2 shell:

pyspark2 --num-executors 1 --executor-cores 1 --packages datastax:spark-cassandra-connector:2.0.1-s_2.10 --conf spark.cassandra.connection.host=192.168.0.1

我使用以下代码创建了一个dataframe:

rdd = sc.parallelize([('Peter',1), ('Sam',2)])
df = sqlContext.createDataFrame(rdd, ["user", "year"])
df.write.format("org.apache.spark.sql.cassandra").mode('append').options(table="users", keyspace="excelsior").save()
    Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/cloudera/parcels/SPARK2-2.0.0.cloudera2-1.cdh5.7.0.p0.118100/lib/spark2/python/pyspark/sql/readwriter.py", line 545, in save
    self._jwrite.save()
  File "/opt/cloudera/parcels/SPARK2-2.0.0.cloudera2-1.cdh5.7.0.p0.118100/lib/spark2/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py", line 1133, in __call__
  File "/opt/cloudera/parcels/SPARK2-2.0.0.cloudera2-1.cdh5.7.0.p0.118100/lib/spark2/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/opt/cloudera/parcels/SPARK2-2.0.0.cloudera2-1.cdh5.7.0.p0.118100/lib/spark2/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o449.save.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 13.0 failed 4 times, most recent failure: Lost task 0.3 in stage 13.0 (TID 56, node6.agatha-cluster, executor 8): java.io.IOException: Failed to write statements to excelsior.users.
        at com.datastax.spark.connector.writer.TableWriter$$anonfun$writeInternal$1.apply(TableWriter.scala:207)
        at com.datastax.spark.connector.writer.TableWriter$$anonfun$writeInternal$1.apply(TableWriter.scala:175)
        at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:112)
        at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:111)
        at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:145)
        at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:111)
        at com.datastax.spark.connector.writer.TableWriter.writeInternal(TableWriter.scala:175)
        at com.datastax.spark.connector.writer.TableWriter.insert(TableWriter.scala:162)
        at com.datastax.spark.connector.writer.TableWriter.write(TableWriter.scala:149)
        at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:36)
        at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:36)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
        at org.apache.spark.scheduler.Task.run(Task.scala:86)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1441)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
        at scala.Option.foreach(Option.scala:257)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1669)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1624)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1613)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
        at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1893)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1906)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1926)
        at com.datastax.spark.connector.RDDFunctions.saveToCassandra(RDDFunctions.scala:36)
        at org.apache.spark.sql.cassandra.CassandraSourceRelation.insert(CassandraSourceRelation.scala:65)
        at org.apache.spark.sql.cassandra.DefaultSource.createRelation(DefaultSource.scala:86)
        at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:457)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:280)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:214)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to write statements to excelsior.users.
        at com.datastax.spark.connector.writer.TableWriter$$anonfun$writeInternal$1.apply(TableWriter.scala:207)
        at com.datastax.spark.connector.writer.TableWriter$$anonfun$writeInternal$1.apply(TableWriter.scala:175)
        at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:112)
        at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:111)
        at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:145)
        at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:111)
        at com.datastax.spark.connector.writer.TableWriter.writeInternal(TableWriter.scala:175)
        at com.datastax.spark.connector.writer.TableWriter.insert(TableWriter.scala:162)
        at com.datastax.spark.connector.writer.TableWriter.write(TableWriter.scala:149)
        at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:36)
        at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:36)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
        at org.apache.spark.scheduler.Task.run(Task.scala:86)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        ... 1 more

>>> 17/06/07 14:47:10 WARN scheduler.TaskSetManager: Lost task 1.3 in stage 13.0 (TID 57, node6.agatha-cluster, executor 8): org.apache.spark.TaskKilledException
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:264)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

共有1个答案

方嘉言
2023-03-14

这可能是由于当前表的复制因素造成的。考虑将它从3减少到1并尝试运行它。如果它有助于检查您的机架、网络超时和集群间通信。

 类似资料:
  • 我试图联合所有大约20个表合并成一个单一的视图。我不断得到一个错误,该错误声明: 如果我将该列注释掉,可以很好地工作。 我尝试将所有列强制转换为相同的数据类型,但没有成功。 我试着注释掉有问题的专栏,这很有效,但我需要那个专栏。 我只尝试了几个表的联合,这取决于文档类型,有时有效,有时无效。

  • 问题内容: 我正在编写一个一次性Java程序,以将CSV文件中的一堆行添加到MySQL数据库。是否有任何Java类/工具包可以帮助您解决此问题?会逃脱必要字符等的东西吗?(例如,准备好的陈述) 还是我应该自己写语句,像这样: 问题答案: 如果使用的是JDBC,请使用PreparedStatement。此类将为您节省手动转义输入的麻烦。 该代码基本上看起来像这样(完全是从内存中开始的-希望我不要忽略

  • 我正试图为我的mysqli连接编写一个非常小的抽象层,但遇到了一个问题。由于我维护的是较旧的代码,我需要从我的查询中获得一个关联数组,因为这是代码设置的方式,因此一旦这样做了,我的工作就少了...这个函数可以处理各种查询(不仅仅是选择)... 我写的函数是这样的: 添加的问题 仅仅为了保留关联数组返回,这样的开销是不是太大了?是否应该改用?

  • null 但这也不起作用。我如何添加'X'作为这个switch语句的默认值,什么可以帮助我防止自己再次犯这个错误?

  • 问题内容: 我知道有一种写简短形式的Java 语句的方法。 有谁知道如何将上述5行的缩写写成一行? 问题答案: 使用三元运算符: 我认为您的条件倒退了-如果为空,则希望该值为“ N / A”。 如果城市为空怎么办?在这种情况下,您的代码*打了床。我还要添加另一张支票:

  • 问题内容: 就像我们在C ++中有预处理器指令用于条件包含。 同样,如何在QML中进行条件设置? 问题答案: 根据您要实现的目标,可能的解决方法是使用装载程序。但是它不导入模块,而只是允许动态选择要使用的QML组件。