问题：

将StringIndexer应用于PySpark数据帧中的多列

谷梁淇

2023-03-14

我有一个派斯帕克数据帧

+-------+--------------+----+----+
|address|          date|name|food|
+-------+--------------+----+----+
|1111111|20151122045510| Yin|gre |
|1111111|20151122045501| Yin|gre |
|1111111|20151122045500| Yln|gra |
|1111112|20151122065832| Yun|ddd |
|1111113|20160101003221| Yan|fdf |
|1111111|20160703045231| Yin|gre |
|1111114|20150419134543| Yin|fdf |
|1111115|20151123174302| Yen|ddd |
|2111115|      20123192| Yen|gre |
+-------+--------------+----+----+

我想将其转换为与 pyspark.ml 一起使用。我可以使用字符串索引器将名称列转换为数字类别：

indexer = StringIndexer(inputCol="name", outputCol="name_index").fit(df)
df_ind = indexer.transform(df)
df_ind.show()
+-------+--------------+----+----------+----+
|address|          date|name|name_index|food|
+-------+--------------+----+----------+----+
|1111111|20151122045510| Yin|       0.0|gre |
|1111111|20151122045501| Yin|       0.0|gre |
|1111111|20151122045500| Yln|       2.0|gra |
|1111112|20151122065832| Yun|       4.0|ddd |
|1111113|20160101003221| Yan|       3.0|fdf |
|1111111|20160703045231| Yin|       0.0|gre |
|1111114|20150419134543| Yin|       0.0|fdf |
|1111115|20151123174302| Yen|       1.0|ddd |
|2111115|      20123192| Yen|       1.0|gre |
+-------+--------------+----+----------+----+

如何用StringIndexer(例如< code>name和< code>food，每个列都有自己的< code>StringIndexer)转换几个列，然后用VectorAssembler生成一个特征向量？还是必须为每一列创建一个< code>StringIndexer？

**编辑**：这不是一个重复，因为我需要通过编程为具有不同列名的多个数据帧设置此选项。我不能使用向量索引器或矢量汇编程序，因为列不是数字。

**编辑2**:暂定解决方案是

indexers = [StringIndexer(inputCol=column, outputCol=column+"_index").fit(df).transform(df) for column in df.columns ]

我现在创建一个包含三个数据帧的列表，每个数据帧都与原始数据帧和转换后的列相同。现在我需要加入以形成最终的数据帧，但这非常低效。

共有3个答案

乌灿

2023-03-14

我可以给你提供以下解决方案。对于大型数据集的这种转换，最好使用管道。它们也使你的代码更容易理解。如果需要，可以向管道中添加更多阶段。例如添加一个编码器。

#create a list of the columns that are string typed
categoricalColumns = [item[0] for item in df.dtypes if item[1].startswith('string') ]

#define a list of stages in your pipeline. The string indexer will be one stage
stages = []

#iterate through all categorical values
for categoricalCol in categoricalColumns:
    #create a string indexer for those categorical values and assign a new name including the word 'Index'
    stringIndexer = StringIndexer(inputCol = categoricalCol, outputCol = categoricalCol + 'Index')

    #append the string Indexer to our list of stages
    stages += [stringIndexer]

#Create the pipeline. Assign the satges list to the pipeline key word stages
pipeline = Pipeline(stages = stages)
#fit the pipeline to our dataframe
pipelineModel = pipeline.fit(df)
#transform the dataframe
df= pipelineModel.transform(df)

请看一下我的推荐信

令狐翰

2023-03-14

使用 PySpark 3.0，现在这更加容易，您可以使用输入单和输出单选项：https://spark.apache.org/docs/latest/ml-features#stringindexer

class pyspark.ml.feature.StringIndexer(
    inputCol=..., 
    outputCol=..., 
    inputCols=..., 
    outputCols=..., 
    handleInvalid='error', 
    stringOrderType='frequencyDesc'
)

徐知

2023-03-14

我发现最好的方法是将几个< code>StringIndex组合在一个列表中，并使用< code >管道来执行它们:

from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer

indexers = [StringIndexer(inputCol=column, outputCol=column+"_index").fit(df) for column in list(set(df.columns)-set(['date'])) ]


pipeline = Pipeline(stages=indexers)
df_r = pipeline.fit(df).transform(df)

df_r.show()
+-------+--------------+----+----+----------+----------+-------------+
|address|          date|food|name|food_index|name_index|address_index|
+-------+--------------+----+----+----------+----------+-------------+
|1111111|20151122045510| gre| Yin|       0.0|       0.0|          0.0|
|1111111|20151122045501| gra| Yin|       2.0|       0.0|          0.0|
|1111111|20151122045500| gre| Yln|       0.0|       2.0|          0.0|
|1111112|20151122065832| gre| Yun|       0.0|       4.0|          3.0|
|1111113|20160101003221| gre| Yan|       0.0|       3.0|          1.0|
|1111111|20160703045231| gre| Yin|       0.0|       0.0|          0.0|
|1111114|20150419134543| gre| Yin|       0.0|       0.0|          5.0|
|1111115|20151123174302| ddd| Yen|       1.0|       1.0|          2.0|
|2111115|      20123192| ddd| Yen|       1.0|       1.0|          4.0|
+-------+--------------+----+----+----------+----------+-------------+

类似资料：

将函数应用于数据帧的列列表？

我从这个URL刮取了这个表： "https://www.patriotsoftware.com/blog/accounting/average-cost-living-by-state/" 看起来像这样：然后我编写了这个函数来帮助我将字符串转换成整数：当我只将函数应用于一列时，它就会工作。我在这里找到了关于在多个列上使用的答案：如何将函数应用于多个列但我下面的代码不起作用，也不会产生错误：
将函数应用于火花数据帧列

并将其应用于数据表的一列--这是我希望这样做的：我还没有找到任何简单的方法，正在努力找出如何做到这一点。一定有一个更简单的方法，比将数据rame转换为和RDD，然后从RDD中选择行来获得正确的字段，并将函数映射到所有的值，是吗？创建一个SQL表，然后用一个sparkSQL UDF来完成这个任务，这更简洁吗？
将函数按行应用于数据帧

我必须从二维坐标计算希尔伯特曲线上的距离。使用hilbertcurve包，我构建了自己的“hilbert”函数。坐标存储在数据帧（列1和列2）中。如您所见，我的函数在应用于两个值（test）时有效。然而，它只是不工作时，应用行明智通过应用函数！这是为什么呢？我到底做错了什么？我需要一个额外的列“希尔伯特”，希尔伯特距离在列“col_1”和“col_2”中给出。最后一个命令以错误结束：谢谢你的
将两个pyspark数据帧相乘

我有一个PySpark数据帧，df1，看起来像: 我有第二个PySpark数据帧，df2 我想将df1的所有列（我有两列以上）与客户ID上的df2连接值相乘
如何将函数应用于Pandas数据帧的两列

怎么办？ **添加详细示例如下***
将 pyspark 中的两个数据帧合并为一列

我有两个数据帧，我需要连接一列，如果id包含在第二个数据帧的同一列中，则只从第一个数据帧中获取行： df1：断续器：期望输出：我已经用df1.join(df2("id ")，" left ")试过了，但是给我错误:“Dataframe”对象是不可调用的。

将StringIndexer应用于PySpark数据帧中的多列

共有3个答案

相关问答

相关文章

相关阅读

相关工具

相关文档