问题：

Pyspark udf在条件定义中返回一列，接受几列作为输入

梁勇

2023-03-14

我使用的是spark 2.1，用法是pyscripting

问题陈述：有一个场景，需要传递多个列作为输入，并返回一列作为输出

a b c

S S S

S NS NS

S NS S

S S NS

新南威尔士州

我的输出必须如下所示

a b c d

S S S S

S NS NS NS

S NS S S

S，S，NS，NS

NS S NS NS

我试图注册一个UDF来传递这3列[a， b， c]作为输入并返回d列作为输出这里a， b， c， d是列名

我发现很难获得输出。下面是使用的语法

def return_string(x):
      if [x.a=='s' & x.b=='S' & x.c=='s']
          return 'S'
      else if[x.a=='s' & x.b=='NS' & x.c=='s']
          return 'S'
      else if[x.a=='s' & x.b=='S' & x.c=='NS']
          return 'NS;

func= udf(returnstring,types.StringType())

有人能帮我完成这个逻辑吗？

共有2个答案

许俊贤

2023-03-14

它应该是：

@udf
def return_string(a, b, c):
    if a == 's' and b == 'S' and c == 's':
        return 'S'
    if a == 's' and b == 'NS' and c == 's':
        return 'S'
    if a == 's' and b == 'S' and c == 'NS':
        return 'NS'

df = sc.parallelize([('s', 'S', 'NS'), ('?', '?', '?')]).toDF(['a', 'b', 'c'])

df.withColumn('result', return_string('a', 'b', 'c')).show()
## +---+---+---+------+
## |  a|  b|  c|result|
## +---+---+---+------+
## |  s|  S| NS|    NS|
## |  ?|  ?|  ?|  null|
## +---+---+---+------+

应列出所有参数（除非您将数据作为结构传递）。
您应该使用和而不是

就我个人而言，我会跳过所有ifs并使用简单的cript：

@udf
def return_string(a, b, c):
    mapping = {
        ('s', 'S', 's'): 'S',
        ('s', 'NS' 's'): 'S',
        ('s', 'S', 'NS'): 'NS',
    }
    return mapping.get((a, b, c))

根据您的要求调整条件。

总的来说，您应该更喜欢Steven Laan提供的优秀答案中所示的SQL表达式（您可以在（…，…）.when（…，……）中使用＜code＞链接多个条件）。

柴飞扬

2023-03-14

我试图使用内置的< code>withColumn和< code>when函数来实现:

from pyspark.sql.functions import col, when, lit

df.withColumn('d', when(
     ((col('A') == 'S') & (col('B') == 'S') & (col('C')=='S'))
   | ((col('A') == 'S') & (col('B') == 'NS') & (col('C')=='S'))
 , lit('S')
 ).otherwise(lit('NS'))
).show()

这也是假设这两个值是相互排斥的（因此反之亦然）

Pyspark udf在条件定义中返回一列，接受几列作为输入

共有2个答案

相关问答

相关文章

相关阅读

相关工具

相关文档