当前位置: 首页 > 知识库问答 >
问题:

从spark读取用CTE (With子句)创建的配置单元视图

强志学
2023-03-14

我有一个用CTE (WITH子句)创建的Hive视图,它联合两个表,然后计算只显示每个id的最新记录。在我的env中,我有一个浏览hive数据库的工具(DBeaver,非datalake dev浏览数据的必备工具)。

查看代码

    CREATE VIEW IF NOT EXISTS db.test_cte_view AS
      with cte as (select * from db.test_cte union select * from db.test_cte_2),
       tmp as (SELECT id, idate, ROW_NUMBER() over(PARTITION BY id ORDER BY idate desc ) AS row_num from cte)
      SELECT cte.* from cte
         join (SELECT * from tmp where tmp.row_num =1) tmp_2
         on cte.id = tmp_2.id
         and cte.idate = tmp_2.idate

问题是:

(这是我们在Hive中创建表和视图的主要方式)< br >我可以很容易地在DBeaver上浏览,但是,当运行spark process来读取它时,它会失败,并出现以下错误:

    ##pyspark
    spark.sql("select * from db.test_cte_view").show()

    'Table or view not found: cte; line 3 pos 56'
    Traceback (most recent call last):
    File "DATA/fs3/hadoop/yarn/local/usercache/ingouagn/appcache/application_1552132357519_15102/container_e378_1552132357519_15102_01_000001/pyspark.zip/pyspark/sql/session.py", line 545, in sql
      return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
    File "/DATA/fs3/hadoop/yarn/local/usercache/ingouagn/appcache/application_1552132357519_15102/container_e378_1552132357519_15102_01_000001/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
      answer, self.gateway_client, self.target_id, self.name)
    File "/DATA/fs3/hadoop/yarn/local/usercache/ingouagn/appcache/application_1552132357519_15102/container_e378_1552132357519_15102_01_000001/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
      raise AnalysisException(s.split(': ', 1)[1], stackTrace)
    pyspark.sql.utils.AnalysisException: 'Table or view not found: cte; line 3 pos 56'

我可以很好地阅读它

    ##pyspark
    spark.sql("select * from db.test_cte_view").show()

但是,当尝试使用DBeaver进行浏览时,它会出现类似以下的错误:

Query execution failed

Reason:
SQL Error [40000] [42000]: Error while compiling statement: FAILED: SemanticException line 1:330 Failed to recognize predicate 'UNION'. Failed rule: 'identifier' in subquery source in definition of VIEW test_cte_view [
SELECT `gen_attr_0` AS `id`, `gen_attr_1` AS `status`, `gen_attr_2` AS `idate` FROM (SELECT `gen_attr_0`, `gen_attr_1`, `gen_attr_2` FROM ((SELECT `gen_attr_0`, `gen_attr_1`, `gen_attr_2` FROM (SELECT `id` AS `gen_attr_0`, `status` AS `gen_attr_1`, `idate` AS `gen_attr_2` FROM `db`.`test_cte`) AS gen_subquery_0) UNION DISTINCT (SELECT `gen_attr_5`, `gen_attr_6`, `gen_attr_7` FROM (SELECT `id` AS `gen_attr_5`, `status` AS `gen_attr_6`, `idate` AS `gen_attr_7` FROM `db`.`test_cte_2`) AS gen_subquery_1)) AS cte INNER JOIN (SELECT `gen_attr_3`, `gen_attr_4`, `gen_attr_8` FROM (SELECT `gen_attr_3`, `gen_attr_4`, `gen_attr_8` FROM (SELECT gen_subquery_4.`gen_attr_3`, gen_subquery_4.`gen_attr_4`, row_number() OVER (PARTITION BY `gen_attr_3` ORDER BY `gen_attr_4` DESC NULLS LAST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS `gen_attr_8` FROM (SELECT `gen_attr_3`, `gen_attr_4` FROM ((SELECT `gen_attr_3`, `gen_attr_9`, `gen_attr_4` FROM (SELECT `id` AS `gen_attr_3`, `status` AS `gen_attr_9`, `idate` AS `gen_attr_4` FROM `db`.`test_cte`) AS gen_subquery_2) UNION DISTINCT (SELECT `gen_attr_5`, `gen_attr_6`, `gen_attr_7` FROM (SELECT `id` AS `gen_attr_5`, `status` AS `gen_attr_6`, `idate` AS `gen_attr_7` FROM `db`.`test_cte_2`) AS gen_subquery_3)) AS cte) AS gen_subquery_4) AS gen_subquery_5) AS tmp WHERE (`gen_attr_8` = 1)) AS tmp_2 ON ((`gen_attr_0` = `gen_attr_3`) AND (`gen_attr_2` = `gen_attr_4`))) AS cte
] used as test_cte_view at Line 1:14

看起来生成的代码在创建视图的一种方式和另一种方式之间是不同的。

有没有办法让第一个场景(通过直线创建视图并通过spark-sql访问它)正常工作?

谢谢。

火花:2.1.1,蜂巢:1.2.1

     CREATE TABLE db.test_cte(
      id int, 
      status string, 
      idate date )
  
     CREATE TABLE db.test_cte_2(
      id int, 
      status string, 
      idate date )

填充有:

      insert into db.test_cte values
      (1,"green","2019-03-08"),
      (2,"green","2019-03-08"),
      (3,"green","2019-03-08"),
      (1,"red","2019-03-09"),
      (1,"yellow","2019-03-10"),
      (2,"gray","2019-03-09")
  
      insert into db.test_cte_2 values
      (10,"green","2019-03-08"),
      (20,"green","2019-03-08"),
      (30,"green","2019-03-08"),
      (10,"red","2019-03-09"),
      (10,"yellow","2019-03-10"),
      (20,"gray","2019-03-09")

编辑:< br >对于任何感兴趣的人,我创建了一个关于https://issues.apache.org/jira/browse/SPARK-27203星火JIRA的问题:< br >

共有1个答案

童宏富
2023-03-14

我在Spark2.1.1.2.6.1.0-129中遇到了同样的问题。升级到Spark2.4解决了这个问题。

如果升级不是一个选项,这个变通办法在2.1上对我起了作用:

spark.table("db.my_view_with_ctes").registerTempTable("tmp")
spark.sql("select * from tmp")

这比在Spark2.4中通过spark-sql读取视图运行时间要长得多(超过我的用例运行时间的10倍),但它是有效的。

 类似资料:
  • 我正在使用Spark SQL读取一个配置单元表,并将其分配给一个scala val 有什么方法可以绕过这个错误吗?我需要将记录插入到同一个表中。 嗨,我试着按建议做,但仍然得到同样的错误。

  • states是按国家分区的,所以当我对上面的数据集进行计数时,查询会扫描所有分区。但是如果我这样读的话- 分区被正确修剪。有人能解释为什么当您将表映射到case类时会丢失分区信息吗?

  • 我对spark数据帧的分区数量有疑问。 如果我有包含列(姓名、年龄、id、位置)的Hive表(雇员)。 如果雇员表有10个不同的位置。因此,在HDFS中将数据划分为10个分区。 如果我通过读取 Hive 表(员工)的整个数据来创建 Spark 数据帧(df)。 Spark 将为数据帧 (df) 创建多少个分区? df.rdd.partitions.size = ??

  • 添加/home/cloudera/date.jar到类路径添加资源:/home/cloudera/date.jar 请有人帮帮我,因为我是新来蜂巢的。有人能告诉我要遵循的步骤吗

  • 您的回应将不胜感激。谢了!

  • 我试图在配置单元中执行select*from db.abc操作,此配置单元表是使用spark加载的 “它不工作”显示错误: 错误:java.io.IOException:java.lang.IllegalArgumentException:bucketId超出范围:-1(状态=,代码=0) 我需要在spark-submit或shell中添加任何属性吗?或者使用spark读取此hiv e表的另一种方