问题：

解析火花sql的复杂类型

祁鸿哲

2023-03-14

Spark DataFrame Schema

root
|-- promotion-id: string (nullable = true)
|-- custom-attributes: struct (nullable = true)
|    |-- custom-attribute: array (nullable = true)
|    |    |-- element: struct (containsNull = true)
|    |    |    |-- _VALUE: string (nullable = true)
|    |    |    |-- _attribute-id: string (nullable = true)
|    |    |    |-- value: array (nullable = true)
|    |    |    |    |-- element: string (containsNull = true)

Sample Input Data

+-------------------------------+----------------------------------------------------------- 
|_promotion-id                  |custom-attribute                                                                                                         
+-------------------------------+----------------------------------------------------------- 
|10-off-selected-appliances-wk39|[[false, geDoNotConvert,], [false, geLoyaltyPromotion,]]
|grewards_wk38_100_prize_draw   |[[,georgeClubAnswers,[Ed, Prof, Sam]]]


Sample output data

promotion_id                     geDoNotConvert  geLoyaltyPromotion  georgeClubAnswers
10-off-selected-appliances-wk39  false           false               null
grewards_wk38_100_prize_draw     null            null                [Ed, Prof, Sam]


Sample Code

val df1 = df.selectExpr("*", "inline(`custom-attributes`.`custom-attribute`)")
df1.groupBy("`_promotion-id`").pivot("_attribute-id").agg(first(col("`_VALUE`")))

数据-我使用XML中的许多附加列获取此类数据，并使用com。databricks spark-xml\u 2.11库，用于将xml数据转换为数据帧。

要求-必须从数组（struct）类型或列custom\u属性转换数据。示例中的custom\u属性，如示例输出所示。My struct有三个字段，分别命名为“\u VALUE”、“属性\u id”、“值”。我需要将属性id转换为列名称，数据为-检查“\u VALUE”是否为非null，如果是，则从该列中选取数据。否则从“值”列中选择数据。请注意，这些列的数据类型可能不同。

此外，我知道需要属性id的列表。

方法1

正如我知道的属性ID一样，我是否可以迭代数组（struct）来识别具有匹配属性ID的结构，并从“\u value”/“value”列中选择值？

方法2

使用内联函数展平DF，并通过获取“\u VALUE”/“VALUE”来透视属性ID列

问题：

方法1-我们可以使用UDF实现它吗？任何例子都会有帮助。

方法2-如果我有多个数组（struct）类型的列怎么办？此外，在pivot和aggr步骤中，我需要对“\u VALUE”/“VALUE”列执行三元操作。我们如何实现它？任何例子都会有帮助

袁凌

2023-03-14

我将回答方法1

假设属性id在中是唯一的

定义案例类别-可选

  case class Attribute(attributeId:String,_Value: String,value: Seq[String]) // Just for readable purpose.
  case class CustomAttributes(geDoNotConvert:String,geLoyaltyPromotion:String,georgeClubAnswers:Seq[String]) // Define required custom attributes

定义自定义项

def parseXml:UserDefinedFunction = udf((customAttribute: Seq[Row]) => {

    val attributes = customAttribute.map{row =>
      val _VALUE = row.getAs[String]("_VALUE") // extracting "_VALUE"
      val _attribute_id = row.getAs[String]("_attribute-id") // extracting "_attribute-id"
      val value = row.getAs[Seq[String]]("value") // extracting "value"
      Attribute(_attribute_id,_VALUE,value) // Wrapping above all columns into case class "CustomAttribute"
    } // Getting all attributes in to an Seq[CustomAttribute]

    val geDoNotConvert = attributes.filter(p => p.attributeId == "geDoNotConvert").headOption.map(_._Value).getOrElse(null)
    val geLoyaltyPromotion = attributes.filter(p => p.attributeId == "geLoyaltyPromotion").headOption.map(_._Value).getOrElse(null)
    val georgeClubAnswers = attributes.filter(p => p.attributeId == "georgeClubAnswers").headOption.map(_.value).getOrElse(null)

    CustomAttributes(geDoNotConvert,geLoyaltyPromotion,georgeClubAnswers) // Returning Case class
  })

读取XML文件，使用UDF分析所需列

val df =  spark.read.option("rowTag", "promotion").xml(xmlPath)
    .select($"_promotion-id",$"custom-attributes.*")
    .withColumn("customAttribute",parseXml($"custom-attribute"))
    .select("_promotion-id","customAttribute.*")

正在打印架构-<代码>df。printSchema（）

root
 |-- _promotion-id: string (nullable = true)
 |-- geDoNotConvert: string (nullable = true)
 |-- geLoyaltyPromotion: string (nullable = true)
 |-- georgeClubAnswers: array (nullable = true)
 |    |-- element: string (containsNull = true)

最终输出-<代码>df。显示（假）

+-------------------------------+--------------+------------------+----------------------------------------+
|_promotion-id                  |geDoNotConvert|geLoyaltyPromotion|georgeClubAnswers                       |
+-------------------------------+--------------+------------------+----------------------------------------+
|grewards_wk38_100_prize_draw   |false         |false             |[Ed Sheeran, Professor Green, Sam Smith]|
|10-off-selected-appliances-wk39|false         |false             |null                                    |
+-------------------------------+--------------+------------------+----------------------------------------+

执行时间，用于两个记录

方法1-所用时间：<代码>4698 ms

方法2-所用时间：8529 ms

解析火花sql的复杂类型

共有1个答案

相关问答

相关文章

相关阅读

相关工具

相关文档