如何使用selectExpr在Spark数据帧中强制转换结构数组？

陈琪

2023-03-14

问题内容：

如何在spark数据帧中强制转换结构数组？

让我通过一个例子来说明我要做什么。我们将从创建一个数据框开始，该数据框包含行和嵌套行的数组。我的整数尚未在数据框中强制转换，它们已创建为字符串：

import org.apache.spark.sql._
import org.apache.spark.sql.types._
val rows1 = Seq(
  Row("1", Row("a", "b"), "8.00", Seq(Row("1","2"), Row("12","22"))),
  Row("2", Row("c", "d"), "9.00", Seq(Row("3","4"), Row("33","44")))
)

val rows1Rdd = spark.sparkContext.parallelize(rows1, 4)

val schema1 = StructType(
  Seq(
    StructField("id", StringType, true),
    StructField("s1", StructType(
      Seq(
        StructField("x", StringType, true),
        StructField("y", StringType, true)
      )
    ), true),
    StructField("d", StringType, true),
    StructField("s2", ArrayType(StructType(
      Seq(
        StructField("u", StringType, true),
        StructField("v", StringType, true)
      )
    )), true)
  )
)

val df1 = spark.createDataFrame(rows1Rdd, schema1)

这是创建的数据框的架构：

       df1.printSchema
       root
       |-- id: string (nullable = true)
       |-- s1: struct (nullable = true)
       |    |-- x: string (nullable = true)
       |    |-- y: string (nullable = true)
       |-- d: string (nullable = true)
       |-- s2: array (nullable = true)
       |    |-- element: struct (containsNull = true)
       |    |    |-- u: string (nullable = true)
       |    |    |-- v: string (nullable = true)

我想做的是将所有可以为整数的字符串都转换为整数。我尝试执行以下操作，但没有成功：

df1.selectExpr("CAST (id AS INTEGER) as id",
  "STRUCT (s1.x, s1.y) AS s1",
  "CAST (d AS DECIMAL) as d",
  "Array (Struct(CAST (s2.u AS INTEGER), CAST (s2.v AS INTEGER))) as s2").show()

我有以下异常：

cannot resolve 'CAST(`s2`.`u` AS INT)' due to data type mismatch: cannot cast array<string> to int; line 1 pos 14;

任何人都有正确的查询将所有值转换为INTEGER吗？我将不胜感激。

非常感谢，

问题答案：

您应该匹配一个完整的结构：

val result = df1.selectExpr(
  "CAST(id AS integer) id",
  "s1",
  "CAST(d AS decimal) d",
  "CAST(s2 AS array<struct<u:integer,v:integer>>) s2"
)

它应该为您提供以下架构：

result.printSchema



root
 |-- id: integer (nullable = true)
 |-- s1: struct (nullable = true)
 |    |-- x: string (nullable = true)
 |    |-- y: string (nullable = true)
 |-- d: decimal(10,0) (nullable = true)
 |-- s2: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- u: integer (nullable = true)
 |    |    |-- v: integer (nullable = true)

和数据：

result.show



+---+-----+---+----------------+
| id|   s1|  d|              s2|
+---+-----+---+----------------+
|  1|[a,b]|  8|[[1,2], [12,22]]|
|  2|[c,d]|  9|[[3,4], [33,44]]|
+---+-----+---+----------------+

如何使用selectExpr在Spark数据帧中强制转换结构数组？

相关阅读

相关文章

相关问答

相关工具

相关文档