Spark SQL Working with JSON

轩辕炎彬
2023-12-01

Working with JSON

  • JSON data is most easily read-in as line delimied   JSON数据最容易以行分隔符的格式进行读取
  • Schema is inferred upon load     Schema信息能自动推导
  • If you want to flatten your JSON data,use the explode method 如果您想要使JSON数据变平,使用explode 方法
  • Access nested-objects with dot syntax   可以使用点的语法访问内嵌的对象
scala> val json = spark.read.format("json").load("file:///home/hadoop/data/test.json")
json: org.apache.spark.sql.DataFrame = [name: string, nums: array<bigint>]

scala> json.show
+--------+----------------+
|    name|            nums|
+--------+----------------+
|zhangsan| [1, 2, 3, 4, 5]|
|    lisi|[6, 7, 8, 9, 10]|
+--------+----------------+
scala> json.createOrReplaceTempView("json_table")


scala>  spark.sql("select * from json_table").show
+--------+----------------+
|    name|       nums|
+--------+----------------+
|zhangsan| [1, 2, 3, 4, 5]|
|    lisi|[6, 7, 8, 9, 10]|
+--------+----------------+
scala>  spark.sql("select name,nums[1] from json_table").show
+--------+----------+
|  name | nums[1]  |
+--------+-------+
|zhangsan|      2|
|   lisi|      7|
+--------+-------+
#If you want to flatten your JSON data,use the explode method
scala>spark.sql("select name,explode(nums) from json_table").show
+--------+---+
|    name|col|
+--------+---+
|zhangsan|  1|
|zhangsan|  2|
|zhangsan|  3|
|zhangsan|  4|
|zhangsan|  5|
|   lisi|  6|
|   lisi|  7|
|   lisi|  8|
|   lisi|  9|
|   lisi|  10|
+--------+---+
test2.json数据
cat test2.json 
{"name":"Yin", "address":{"city":"Columbus","state":"Ohio"}}
{"name":"Michael", "address":{"city":null, "state":"California"}}
scala> val json = spark.read.format("json").load("file:///home/hadoop/data/test2.json")
json: org.apache.spark.sql.DataFrame = [address: struct<city: string, state: string>, name: string]


scala> json.show
+-----------------+-------+
|      address |  name|
+-----------------+-------+
|  [Columbus,Ohio]|   Yin|
|[null,California]|Michael|
+-----------------+-------+


scala> json.createOrReplaceTempView("json_table")


scala>spark.sql("select * from json_table").show
+-----------------+-------+
|      address |  name|
+-----------------+-------+
|  [Columbus,Ohio]|   Yin|
|[null,California]|Michael|
+-----------------+-------+
#Access nested-objects with dot syntax 

scala>spark.sql("select name,address.city,address.state from json_table").show
+-------+--------+----------+
|  name|   city|   state|
+-------+--------+----------+
|   Yin|Columbus|    Ohio|
|Michael|   null|California|
+-------+--------+----------+







 类似资料: