问题：

PySpark/Spark窗口函数第一个/最后一个问题

胡星汉

2023-03-14

AgeWindow = Window.partitionBy('Dept').orderBy('Age')
df1 = df1.withColumn('first(ID)', first('ID').over(AgeWindow))\
        .withColumn('last(ID)', last('ID').over(AgeWindow))           
df1.show()

+---+----------+---+--------+--------------------------+-------------------------+
|Age|      Dept| ID|    Name|first(ID)                 |last(ID)                |
+---+----------+---+--------+--------------------------+-------------------------+
| 38|  medicine|  4|   harry|                         4|                        4|
| 41|  medicine|  5|hermione|                         4|                        5|
| 55|  medicine|  7| gandalf|                         4|                        7|
| 15|technology|  6|  sirius|                         6|                        6|
| 49|technology|  9|     sam|                         6|                        9|
| 88|technology|  1|     sam|                         6|                        2|
| 88|technology|  2|     nik|                         6|                        2|
| 75|       mba|  8|   ginny|                         8|                       11|
| 75|       mba| 10|     sam|                         8|                       11|
| 75|       mba|  3|     ron|                         8|                       11|
| 75|       mba| 11|     ron|                         8|                       11|
+---+----------+---+--------+--------------------------+-------------------------+

共有1个答案

江温书

2023-03-14

这并不是不正确的。您的窗口定义并不是您所认为的那样。

如果提供order by子句，则默认框架为前行和当前行之间的无界范围:

from pyspark.sql.window import Window
from pyspark.sql.functions import first, last

w = Window.partitionBy('Dept').orderBy('Age')

df = spark.createDataFrame(
    [(38, "medicine", 4), (41, "medicine", 5), (55, "medicine", 7)],
    ("Age", "Dept", "ID")
)

df.select(
    "*",
    first('ID').over(w).alias("first_id"), 
    last('ID').over(w).alias("last_id")
).explain()

== Physical Plan ==
Window [first(ID#24L, false) windowspecdefinition(Dept#23, Age#22L ASC NULLS FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), currentrow$())) AS first_id#38L, last(ID#24L, false) windowspecdefinition(Dept#23, Age#22L ASC NULLS FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), currentrow$())) AS last_id#40L], [Dept#23], [Age#22L ASC NULLS FIRST]
+- *(1) Sort [Dept#23 ASC NULLS FIRST, Age#22L ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(Dept#23, 200)
      +- Scan ExistingRDD[Age#22L,Dept#23,ID#24L]

这意味着窗口函数从不向前看，框架中的最后一行是当前行。

w_uf = (Window
   .partitionBy('Dept')
   .orderBy('Age')
   .rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing))

result = df.select(
    "*", 
    first('ID').over(w_uf).alias("first_id"),
    last('ID').over(w_uf).alias("last_id")
)

== Physical Plan ==
Window [first(ID#24L, false) windowspecdefinition(Dept#23, Age#22L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS first_id#56L, last(ID#24L, false) windowspecdefinition(Dept#23, Age#22L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS last_id#58L], [Dept#23], [Age#22L ASC NULLS FIRST]
+- *(1) Sort [Dept#23 ASC NULLS FIRST, Age#22L ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(Dept#23, 200)
      +- Scan ExistingRDD[Age#22L,Dept#23,ID#24L]

result.show()

+---+--------+---+--------+-------+
|Age|    Dept| ID|first_id|last_id|
+---+--------+---+--------+-------+
| 38|medicine|  4|       4|      7|
| 41|medicine|  5|       4|      7|
| 55|medicine|  7|       4|      7|
+---+--------+---+--------+-------+

类似资料：

PostgreSQL聚合或窗口函数仅返回最后一个值

问题内容：我在PostgreSQL 9.1中使用带有OVER子句的聚合函数，并且只想返回每个窗口的最后一行。该窗口的功能听起来像它可能做我想做的- 但事实并非如此。它为窗口中的每一行返回一行，而我希望每个窗口仅一行一个简化的例子：我希望它返回一行：问题答案： [](http://www.postgresql.org/docs/current/interactive/sql- select.
第一课：新建一个窗口

第一课：新建一个窗口简介欢迎来到第一课！在学习OpenGL之前，我们将先学习如何生成，运行，和玩转（最重要的一点）课程中的代码。预备知识不需要特别的预备知识。如果你有C、Java、Lisp、Javascript等编程语言的经验，那么理解课程代码会更快；但这不是必需的；如果没有，那么也仅仅是同时学两样东西（编程语言+OpenGL）会稍微复杂点而已。课程全部用“傻瓜式C++”编写：我费了
Spark groupby，对值排序，然后取第一个和最后一个

我使用的是Apache Spark，它的数据帧如下所示：我想按字段分组，以获得的所有历元时间戳。然后我想按时间戳升序对历代进行排序，然后取第一个和最后一个历代。我使用了下面的查询，但是和历元值似乎是按照它们在原始数据帧中出现的顺序获取的。我想把第一个和最后一个从一个有序的升序中取出来。如何从按升序历元排序的历元列表中检索第一个和最后一个历元？
spark sql窗口函数滞后

我在Scala中查看幻灯片函数中的Spark。
3.1 编写第一个窗口程序

现在我们开始编写全书的第一个程序。跟我们以前学习程序设计的方法不同（以前我们是输入完整程序，然后运行），我们首先利用Visual Studio的可视化编程工具AppWizard生成框架程序，再往里边填写代码。这是一种“填空式”的编程方法：首先生成框架，然后根据目标程序的要求，看哪些地方需要修改，再往里填写代码。类似其他语言，我们把第一个程序命名为Hello。首先启动AppWizard：在File
selenium-webdriver，从一个窗口切换到另一个窗口

目前，我已经开始使用Selenium2.0/Web-Driver为我工作的公司进行自动化测试。目前我已经开发了大约20个测试，但是当我运行这些测试时，它们会为每个测试打开一个新的浏览器窗口。我在注册测试用例中运行它，然后在第二个测试用例中运行，我认为应该将焦点放回第一个窗口。我还使用关闭正在创建的其他窗口，但我希望它们一开始就不打开。

PySpark/Spark窗口函数第一个/最后一个问题

共有1个答案

相关问答

相关文章

相关阅读

相关工具

相关文档