问题：

将spark dataframe列中的值提取到新的派生列中

瞿博易

2023-03-14

下面有以下dataframe架构

        root
         |-- SOURCE: string (nullable = true)
         |-- SYSTEM_NAME: string (nullable = true)
         |-- BUCKET_NAME: string (nullable = true)
         |-- LOCATION: string (nullable = true)
         |-- FILE_NAME: string (nullable = true)
         |-- LAST_MOD_DATE: string (nullable = true)
         |-- FILE_SIZE: string (nullable = true)

example 1: prod/docs/Folder1/AA160039/Folder2/XXX.pdf
example 2: prod/docs/Folder1/FolderX/Folder3/355/Folder2/zzz.docx

1. the 2 characters followed by 6 digits between the slashes. Output is "AA160039".This expression or mask will not change. always 2 characters followed by 6 digits
2. strip digits only if they are between slashes. Output is "355" from example above. The numbers could be a single digit such as "8", double digits "55", triple "444", up to 5 digits "12345". As long as they are between slashes, they need to be extracted into new column.

df1 = df0.withColumn("LOCATION", trim(col('LOCATION')))
if location like '%/[A-Z]{2}[0-9]{6}/%' -- extract value and add to new derived column
if location like '%/[0-9]{1 or 2 or 3 or 4 or 5}/%' -- extract value and add to new derived column

df1 = df0.withColumn("LAST_MOD_DATE",(col("LAST_MOD_DATE").cast("timestamp")))\
                         .withColumn("FILE_SIZE",(col("FILE_SIZE").cast("integer")))\
                         .withColumn("LOCATION", trim(col('LOCATION')))\
                         .withColumn("FOLDER_NUM", when(regexp_extract(col("FILE_NAME"), "([A-Z]{2}[0-9]{6}).*", 1) != lit(""), 
                                                     regexp_extract(col("LOCATION"), ".*/([A-Z]{2}[0-9]{6})/.*", 1))
                                                .otherwise(regexp_extract(col("LOCATION"),".*/([0-9]{1,5})/.*" , 1)))



+------+-----------+------------+--------------------+-------------------+-------------------+---------+-------+
|SOURCE|SYSTEM_NAME| BUCKET_NAME|            LOCATION|          FILE_NAME|      LAST_MOD_DATE|FILE_SIZE|FOLDER_NUM|
+------+-----------+------------+--------------------+-------------------+-------------------+---------+-------+
|    s3|       xxx|bucket1|production/Notifi...|AA120068_Letter.pdf|2020-07-20 15:51:21|    13124|       |
|    s3|       xxx|bucket1|production/Notifi...|ZZ120093_Letter.pdf|2020-07-20 15:51:21|    61290|       |
|    s3|       xxx|bucket1|production/Notifi...|XC120101_Letter.pdf|2020-07-20 15:51:21|    61700|       |

共有1个答案

山森

2023-03-14

你走得很好：

from pyspark.sql.functions import regexp_extract, trim

df = spark.createDataFrame([{"old_column": "ex@mple trimed"}], 'old_column string')

df.withColumn('new_column'. regexp_extract(trim('old_column'), '(e.*@)', 1)).show()

这将修剪并提取与regex表达式匹配的组1的模式

类似资料：

从spark dataframe中的列提取值，并将其提取到两个新列

但我如何也摆脱这些假想呢？
基于“标识”列的“ JPA派生列”值

问题内容： JPA 2.0（Hibernate 4.2.4.Final/Spring 3.2.8.Release）/ Mysql 5.6 对于管理实体E w /自动生成的主键，例如出于传统原因，foo需要等于：{id}:。例如，如果id为204，则foo将为“：204：”，因此在事务中发生这种情况是可行的有没有一种更好的方法来计算值取决于生成的ID的派生列？没有上述技巧，即在持久化之后直接更新
使用JOOQ中的RecordMapper将派生列映射到pojo

我有一个表tickets，其中插入ticket，并有一个createdBy字段，用于存储该记录创建者的UserId整数。在抓取过程中，我与users表和concat firstname和last name连接，并且我的DTO具有由连接的创建者名称创建的字段createdBy。如何映射派生字段？这是我的推荐信https://www.jooq.org/doc/3.13/manual/sql-execu
在Apache Spark中将Dataframe的列值提取为列表
具有派生值列的视图

我在尝试创建一个带有计算平均列的视图时遇到了一些麻烦，对于每个电影行，我都需要一个基于分级表中该电影的所有分级的平均分级。电影表：评级表：以下是我目前掌握的情况：它告诉我派生表需要自己的别名，我不知道这是否真的给出了每行的平均值。
将数据帧中的值列表追加到新列[重复]

我试图做的是从列“in_reply_to_user_id”（不在图片中，因为df太宽，无法容纳）与给定id具有相同值的行中获取文本，并将文本附加到列表中，然后将其放入新列中。例如，所有tweet中的“in_reply_to_user_id”列等于第一条tweet的“id”的文本都应该放在一个列表中，然后添加到数据框中名为“reples”的新列中。以下是我尝试过的一些事情：

将spark dataframe列中的值提取到新的派生列中

共有1个答案

相关问答

相关文章

相关阅读

相关工具

相关文档