问题：

Spark Streaming check pointing引发不可序列化异常

南门向荣

2023-03-14

我们使用的是基于Spark Streaming接收器的方法，我们刚刚启用了检查指向来解决数据丢失问题。

火花版本是1.6.1，我们正在接收来自Kafka主题的消息。

我在内部使用了ssc，foreachRDD方法DStream，所以它抛出了不可序列化的异常。

我试图扩展可序列化的类，但仍然是相同的错误。只有当我们启用检查点时，才会发生这种情况。

def main(args: Array[String]): Unit = {

    val checkPointLocation = "/path/to/wal"
    val ssc = StreamingContext.getOrCreate(checkPointLocation, () => createContext(checkPointLocation))
    ssc.start()
    ssc.awaitTermination()
  }

    def createContext (checkPointLocation: String): StreamingContext ={

        val sparkConf = new SparkConf().setAppName("Test")
        sparkConf.set("spark.streaming.receiver.writeAheadLog.enable", "true")
        val ssc = new StreamingContext(sparkConf, Seconds(40))
        ssc.checkpoint(checkPointLocation)
        val sc = ssc.sparkContext
        val sqlContext: SQLContext = new HiveContext(sc)
        val kafkaParams = Map("group.id" -> groupId,
        CommonClientConfigs.SECURITY_PROTOCOL_CONFIG -> sasl,
        ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> "org.apache.kafka.common.serialization.StringDeserializer",
        ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> "org.apache.kafka.common.serialization.StringDeserializer",
        "metadata.broker.list" -> brokerList,
        "zookeeper.connect" -> zookeeperURL)
      val dStream = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicMap, StorageLevel.MEMORY_AND_DISK_SER).map(_._2)
      dStream.foreachRDD(rdd =>
        {
           // using sparkContext / sqlContext to do any operation throws error.
           // convert RDD[String] to RDD[Row]
           //Create Schema for the RDD.
           sqlContext.createDataFrame(rdd, schema)
        })
        ssc
    }

错误日志：

2017-02-08 22:53:53250错误[驱动程序]流媒体。StreamingContext：启动上下文时出错，将其标记为已停止的java。伊奥。NotSerializableException:已启用数据流检查点，但具有其功能的数据流不可序列化。阿帕奇。火花SparkContext序列化堆栈：-对象不可序列化（类：org.apache.spark.SparkContext，值：org.apache.spark）。SparkContext@1c5e3677)-field（class:com.x.payments.RemedyDriver$$anonfun$main$1，name:sc$1，type:class org.apache.spark.SparkContext）-object（class:com.x.payments.RemedyDriver$$anonfun$main$1，）-field（class:org.apache.spark.streaming.dstream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3，name:cleanedF$1，type:interface scala.Function1）-object（类org.apache.spark.streaming.dstream.dstream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3，）-writeObject数据（类：org.apache.spark.stream.dstream.dstream）-object（类org.apache.spark.streaming.dstream.ForEachDStream，org.apache.spark.stream.dstream.dstream）。ForEachDStream@68866c5)-数组元素（索引：0）-数组（类[Ljava.lang.Object；，大小16）-字段（类：scala.collection.mutable.ArrayBuffer，名称：数组，类型：类[Ljava.lang.Object；）-对象（类scala.collection.mutable.ArrayBuffer，ArrayBuffer）（org.apache.spark.streaming.dstream）。ForEachDStream@68866c5))-writeObject数据（类：org.apache.spark.streaming.dstream.dstream checkpointdata）-对象（类org.apache.spark.stream.dstream.dstream checkpointdata，[0个检查点文件

]）-写入对象数据（类：org.apache.spark.streaming.dstream.DStream）-对象（类org.apache.spark.streaming.kafka.KafkaInputDStream，org.apache.spark.streaming.kafka.KafkaInputDStream@acd8e32）-数组元素（索引：0）-数组（类[Ljava.lang.Object；，大小16）-字段（类：scala.collection.mutable.ArrayBuffer，名称：数组，类型：类[Ljava.lang.Object；）-对象（类scala.collection.mutable.ArrayBuffer，ArrayBuffer（org.apache.spark.streaming.kafka.KafkaInputDStream@acd8e32））-写入对象数据（类：org.apache.spark.streaming.DStreamGough）-对象（类org.apache.spark.streaming.DStreamGough，org.apache.spark.streaming.DStreamGraph@6935641e）-字段（类：org.apache.spark.streaming.Checkpoint，名称：图，类型：类org.apache.spark.streaming.DStreamGough）-对象（类org.apache.spark.streaming.Checkpoint，org.apache.spark.streaming.Checkpoint@484bf033）在org.apache.spark.streaming.StreamingContext.validate（StreamingContexts. scala：557）在org. apache. spark. stream。StreamingContexte. liftedTree11美元（StreamingContexts. scala：601）在org. apache. spak. stream。StreamingContexte. start（StreamingContexte. scala：600）在com. x. pay。RemedyDriver$. main（RemedyDriver. scala：104）在com. x. pay。RemedyDriver. main（RemedyDriver. scala）在sun. reect。NativeomeodAccessorInm. Invoke（NativeomeodAccessorInm. java：62）在sun. reect。Application Master$anon2 Dollars. run（Application. master. scala： 559）2017-02-08 22:53:53,250 ERROR[驱动程序]付款。RemedyDriver$：DStream检查点已启用，但DStreams及其功能不可序列化org. apache. spark。SparkContext序列化堆栈：-对象不可序列化（类：org. apache. spark。SparkContext，值：org.apache.spark.SparkContext@1c5e3677）-字段（类：com. x. Payments。RemedyDriver$anonfun$main 1美元，名称：sc1美元，类型：类org. apache. sparkContext）-对象（类com. x. Payments。RemedyDriver$anonfun$main 1美元，）-字段（类：org. apache. spark. Streing. dstream。DStream$anonfun$ForeachRDD1美元$anonfun$应用$mcV$sp3美元，名称：清洁F1DStream$$anonfun$ForeachRDD1美元$$anonfun$申请$mcV$sp3美元，）-写入对象数据（类：org. apache. sak. Streing. dstream. DStream）-对象（类org. apache. spark. Streing. dstream. ForEachDStream，org.apache.spark.streaming.dstream.ForEachDStream@68866c5）-数组元素（索引：0）-数组（类[Ljava. lang. Object；，大小16）-字段（类：Scala.集合. Mutable. ArrayBuffer，名称：数组，类型：类[Ljava. lang. Object；）-对象（类Scala.集合. Mutable. ArrayBuffer，ArrayBuffer（org.apache.spark.streaming.dstream.ForEachDStream@68866c5））-写入对象数据（类：org. apache. spark. Streing. dStreamCheckpoint Data）-对象（类org. apache. spark. Streing. dstream. DStreamCheckpoint Data，[0检查点文件

])-writeObject数据（类：org.apache.spark.streaming.dstream.dstream）-对象（类：org.apache.spark.streaming.kafka.kafkainputdtream，org.apache.spark.streaming.kafka）。KafkaInputDStream@acd8e32)-数组元素（索引：0）-数组（类[Ljava.lang.Object；，大小16）-字段（类：scala.collection.mutable.ArrayBuffer，名称：数组，类型：类[Ljava.lang.Object；）-对象（类scala.collection.mutable.ArrayBuffer，ArrayBuffer）（org.apache.spark.streaming.kafka）。KafkaInputDStream@acd8e32))-writeObject数据（类：org.apache.spark.streaming.DStreamGraph）-对象（类org.apache.spark.streaming.DStreamGraph，org.apache.spark.streaming）。DStreamGraph@6935641e)-字段（类：org.apache.spark.streaming.Checkpoint，名称：graph，类型：class org.apache.spark.streaming.DStreamGraph）-对象（类：org.apache.spark.streaming.Checkpoint，org.apache.spark.streaming）。Checkpoint@484bf033)2017-02-08 22:53:53255信息[司机]纱线。应用程序管理员：最终应用程序状态：成功，退出代码：0

更新：

基本上，我们试图做的是，将rdd转换为DF（数据流的内部foreachRDD方法），然后在此基础上应用DF API，最后将数据存储在Cassandra中。所以我们使用sqlContext将rdd转换为DF，这一次它会抛出错误。

杜河

2023-03-14

如果要访问SparkContext，请通过rdd值：

dStream.foreachRDD(rdd => {
  val sqlContext = new HiveContext(rdd.context)
  val dataFrameSchema = sqlContext.createDataFrame(rdd, schema)
}

这是：

dStream.foreachRDD(rdd => {
  // using sparkContext / sqlContext to do any operation throws error.
   val numRDD = sc.parallelize(1 to 10, 2)
   log.info("NUM RDD COUNT:"+numRDD.count())
}

导致SparkContext在闭包中被序列化，这是不可能的，因为它是不可序列化的。

Spark Streaming check pointing引发不可序列化异常

共有1个答案

相关问答

相关文章

相关阅读

相关工具

相关文档