下面有以下dataframe架构
root
|-- SOURCE: string (nullable = true)
|-- SYSTEM_NAME: string (nullable = true)
|-- BUCKET_NAME: string (nullable = true)
|-- LOCATION: string (nullable = true)
|-- FILE_NAME: string (nullable = true)
|-- LAST_MOD_DATE: string (nullable = true)
|-- FILE_SIZE: string (nullable = true)
example 1: prod/docs/Folder1/AA160039/Folder2/XXX.pdf
example 2: prod/docs/Folder1/FolderX/Folder3/355/Folder2/zzz.docx
1. the 2 characters followed by 6 digits between the slashes. Output is "AA160039".This expression or mask will not change. always 2 characters followed by 6 digits
2. strip digits only if they are between slashes. Output is "355" from example above. The numbers could be a single digit such as "8", double digits "55", triple "444", up to 5 digits "12345". As long as they are between slashes, they need to be extracted into new column.
df1 = df0.withColumn("LOCATION", trim(col('LOCATION')))
if location like '%/[A-Z]{2}[0-9]{6}/%' -- extract value and add to new derived column
if location like '%/[0-9]{1 or 2 or 3 or 4 or 5}/%' -- extract value and add to new derived column
df1 = df0.withColumn("LAST_MOD_DATE",(col("LAST_MOD_DATE").cast("timestamp")))\
.withColumn("FILE_SIZE",(col("FILE_SIZE").cast("integer")))\
.withColumn("LOCATION", trim(col('LOCATION')))\
.withColumn("FOLDER_NUM", when(regexp_extract(col("FILE_NAME"), "([A-Z]{2}[0-9]{6}).*", 1) != lit(""),
regexp_extract(col("LOCATION"), ".*/([A-Z]{2}[0-9]{6})/.*", 1))
.otherwise(regexp_extract(col("LOCATION"),".*/([0-9]{1,5})/.*" , 1)))
+------+-----------+------------+--------------------+-------------------+-------------------+---------+-------+
|SOURCE|SYSTEM_NAME| BUCKET_NAME| LOCATION| FILE_NAME| LAST_MOD_DATE|FILE_SIZE|FOLDER_NUM|
+------+-----------+------------+--------------------+-------------------+-------------------+---------+-------+
| s3| xxx|bucket1|production/Notifi...|AA120068_Letter.pdf|2020-07-20 15:51:21| 13124| |
| s3| xxx|bucket1|production/Notifi...|ZZ120093_Letter.pdf|2020-07-20 15:51:21| 61290| |
| s3| xxx|bucket1|production/Notifi...|XC120101_Letter.pdf|2020-07-20 15:51:21| 61700| |
你走得很好:
from pyspark.sql.functions import regexp_extract, trim
df = spark.createDataFrame([{"old_column": "ex@mple trimed"}], 'old_column string')
df.withColumn('new_column'. regexp_extract(trim('old_column'), '(e.*@)', 1)).show()
这将修剪并提取与regex表达式匹配的组1的模式
但我如何也摆脱这些假想呢?
问题内容: JPA 2.0(Hibernate 4.2.4.Final/Spring 3.2.8.Release)/ Mysql 5.6 对于管理实体E w /自动生成的主键,例如 出于传统原因,foo需要等于:{id}:。例如,如果id为204,则foo将为“:204:”,因此在事务中发生这种情况是可行的 有没有一种更好的方法来计算值取决于生成的ID的派生列?没有上述技巧,即在持久化之后直接更新
我有一个表tickets,其中插入ticket,并有一个createdBy字段,用于存储该记录创建者的UserId整数。在抓取过程中,我与users表和concat firstname和last name连接,并且我的DTO具有由连接的创建者名称创建的字段createdBy。如何映射派生字段?这是我的推荐信https://www.jooq.org/doc/3.13/manual/sql-execu
我在尝试创建一个带有计算平均列的视图时遇到了一些麻烦,对于每个电影行,我都需要一个基于分级表中该电影的所有分级的平均分级。 电影表: 评级表: 以下是我目前掌握的情况: 它告诉我派生表需要自己的别名,我不知道这是否真的给出了每行的平均值。
我试图做的是从列“in_reply_to_user_id”(不在图片中,因为df太宽,无法容纳)与给定id具有相同值的行中获取文本,并将文本附加到列表中,然后将其放入新列中。例如,所有tweet中的“in_reply_to_user_id”列等于第一条tweet的“id”的文本都应该放在一个列表中,然后添加到数据框中名为“reples”的新列中。以下是我尝试过的一些事情: