问题：

rlike中的pyspark dataframe如何从dataframe列中逐行传递字符串值[重复]

邢高澹

2023-03-14

运行时得到错误meesagge

    df.withColumn("match_str", df.text1.rlike(df.match)).show(truncate=False)

        Py4JError: An error occurred while calling o2165.rlike. Trace:
        py4j.Py4JException: Method rlike([class org.apache.spark.sql.Column]) does not exist

你知道什么变通办法或解决办法吗？

    df = spark.createDataFrame([
        (1, 'test1 test1_0|test1 test0', 'This is a test1 test1_0'),
        (2, 'test2 test2_0|test1 test0', None),
        (3, 'Nan', 5.2, 23, 'Nan'),
        (4, 'test4 test4_0|test1 test0', 'This is a test4 test4_0'),
       ], ['id', 'match', 'text1'])



    +---+-------------------------+-----------------------+
    |id |match                    |text1                  |
    +---+-------------------------+-----------------------+
    |1  |test1 test1_0|test1 test0|This is a test1 test1_0|
    |2  |test2 test2_0|test1 test0|null                   |
    |3  |Nan                      |Nan                    |
    |4  |test4 test4_0|test1 test0|This is a test4 test4_0|
    +---+-------------------------+-----------------------+

    root
     |-- id: long (nullable = true)
     |-- match: string (nullable = true)
     |-- text1: string (nullable = true)


    df.withColumn("match_str", df.text1.rlike(df.select(df.match).head()["match"])).show(truncate=False)

注意:df.select(df.match).head()[“match”]传递值第一行匹配，在本例中匹配“test1 test1_0test1 test0”到所有行。我想逐行传递rlike值。像

null

    +---+-------------------------+-----------------------+---------+
    |id |match                    |text1                  |match_str|
    +---+-------------------------+-----------------------+---------+
    |1  |test1 test1_0|test1 test0|This is a test1 test1_0|true     |
    |2  |test2 test2_0|test1 test0|null                   |null     |
    |3  |Nan                      |Nan                    |false    |
    |4  |test4 test4_0|test1 test0|This is a test4 test4_0|false    |
    +---+-------------------------+-----------------------+---------+

    df.withColumn("match_str", df.text1.rlike(df.match)).show(truncate=False)

        Py4JError: An error occurred while calling o2165.rlike. Trace:
        py4j.Py4JException: Method rlike([class org.apache.spark.sql.Column]) does not exist

预期成果：

    +---+-------------------------+-----------------------+---------+
    |id |match                    |text1                  |match_str|
    +---+-------------------------+-----------------------+---------+
    |1  |test1 test1_0|test1 test0|This is a test1 test1_0|true     |
    |2  |test2 test2_0|test1 test0|null                   |false    |
    |3  |Nan                      |Nan                    |true     |
    |4  |test4 test4_0|test1 test0|This is a test4 test4_0|true     |
    +---+-------------------------+-----------------------+---------+

共有1个答案

米项禹

2023-03-14

不幸的是，pyspark.sql.column.rlike()方法只接受text模式，而不接受其他列作为模式（但是您可以使用UDF-S)根据需要调整它）。

您的问题的快速解决方案是使用Pyspark sqlrlike（就像普通sqlrlike):

>>> from pyspark.sql import *
>>> from pyspark.sql.functions import *
>>> df = sqlContext.createDataFrame([
...     (1, 'test1 test1_0|test1 test0', 'This is a test1 test1_0'),
...     (2, 'test2 test2_0|test1 test0', None),
...     (3, 'Nan', 'Nan'),
...     (4, 'test4 test4_0|test1 test0', 'This is a test4 test4_0')
...    ], ['id', 'match', 'text1'])
>>> df.select("id", "match", "text1", expr("text1 rlike concat('(', match, ')$') as match_str")).show()
+---+--------------------+--------------------+---------+
| id|               match|               text1|match_str|
+---+--------------------+--------------------+---------+
|  1|test1 test1_0|tes...|This is a test1 t...|     true|
|  2|test2 test2_0|tes...|                null|     null|
|  3|                 Nan|                 Nan|     true|
|  4|test4 test4_0|tes...|This is a test4 t...|     true|
+---+--------------------+--------------------+---------+

只是稍微修改一下您的示例，因为您在那里操作字符串，并且“nan”字符串等于其他“nan”字符串：

>>>
... df2 = sqlContext.createDataFrame([
...     (1, 'test1 test1_0|test1 test0', 'This is a test1 test1_0x'),
...     (2, 'test2 test2_0|test1 test0', None),
...     (3, 'NanA', 'Nan'),
...     (4, 'test4 test4_0|test1 test0', 'This is a test4 test4_0')
...    ], ['id', 'match', 'text1'])
>>>
... df2.select("id", "match", "text1", expr("text1 rlike concat('(', match, ')$') as match_str")).show()
+---+--------------------+--------------------+---------+
| id|               match|               text1|match_str|
+---+--------------------+--------------------+---------+
|  1|test1 test1_0|tes...|This is a test1 t...|    false|
|  2|test2 test2_0|tes...|                null|     null|
|  3|                NanA|                 Nan|    false|
|  4|test4 test4_0|tes...|This is a test4 t...|     true|
+---+--------------------+--------------------+---------+

类似资料：

DataFrame:从一列中的字符串字典到两列中的浮动{'latitude'：'34.04'，'latitude'：'-118.24'}[重复]

我有一个带有列的熊猫数据帧。此列中的行具有以下格式：。为了能够向地图添加标记（使用folium Librarie），我想创建两列和，在本例中分别包含和。编辑：管理它与第一步一起工作：df['latlng']=df['latlng'].map（eval）
从字符串[重复]中获取值

我有以下字符串：
如何从字符串数组中查找重复的字符串

问题内容：我有一个字符串数组，其中包含字符串列表。我想弄清楚此列表中是否有重复的条目。基本上，我有一个用户列表，应该没有重复的条目。问题答案：您可以将String数组添加到HashSet 这将为您提供唯一的String值。如有必要，将HashSet转换回数组
如何在Swift中从字符串中删除重复的字符

问题内容：红宝石有功能string.squeeze，但我似乎找不到快速等效。例如我想转簿记员-> bokepr 我唯一的选择是创建一组字符，然后将字符从该组拉回到字符串中吗？有一个更好的方法吗？问题答案：编辑/更新： Swift 4.2或更高版本您可以使用一组来过滤重复的字符：或作为扩展，也将扩展String和Substrings：
检查Pandas DataFrame列中的字符串是否在字符串列表中

问题内容：如果我有这样的框架我想检查这些行中是否包含某个单词，我只需要这样做。输出：如果我决定列出一个清单如何检查列表中的行是否包含某个单词？问题答案：该方法接受正则表达式模式：由于支持正则表达式模式，因此您还可以嵌入标志：
检查Pandas DataFrame列中的字符串是否在字符串列表中

如果我有这样一个框架我想检查这些行中是否有包含某个单词的行，我必须这样做。输出：如果我决定列一个清单：如何检查行是否包含列表中的某个单词？

rlike中的pyspark dataframe如何从dataframe列中逐行传递字符串值[重复]

共有1个答案

相关问答

相关文章

相关阅读

相关工具

相关文档