当前位置: 首页 > 工具软件 > Tokenizer > 使用案例 >

Pyspark特征工程--Tokenizer

陶瀚玥
2023-12-01

Tokenizer是一个分词器

​ Tokenizer是将文本如一个句子拆分成单词的过程,在spark ml中提供Tokenizer实现此功能RegexTokenizer提供了跟高级的基于正则表达式匹配的单词拆分

​ 默认情况下:

​ 参数pattern(默认的正则表达式:"\s+") 作为分隔符用于拆分输入的文本

​ 用户将可将参数 gaps设置为false,指定正则表达式pattern表示为tokens,而不是分隔符,这样作为划分结果 找到的所有匹配项

01.导入模块,创建对象

from pyspark.sql import SparkSession
spark = SparkSession.builder.config("spark.driver.host","192.168.1.4")\
    .config("spark.ui.showConsoleProgress","false")\
    .appName("Tokenizer").master("local[*]").getOrCreate()

02.以python之禅的部分内容作为文本(import this)

data = spark.createDataFrame([
    ("Beautiful is better than ugly",),
    ("Explicit is better than implicit",),
    ("Simple is better than complex",),
    ("Complex is better than complicated",),
    ("Flat is better than nested",),
    ("Sparse is better than dense",)
],["python_This"])
data.show()

​ 输出结果:

+--------------------+
|         python_This|
+--------------------+
|Beautiful is bett...|
|Explicit is bette...|
|Simple is better ...|
|Complex is better...|
|Flat is better th...|
|Sparse is better ...|
+--------------------+

03.使用Tokenizer分词器,转换数据

from pyspark.ml.feature import Tokenizer
tokenizer = Tokenizer(inputCol="python_This",outputCol="res")
data = tokenizer.transform(data)
data.show()

​ 输出结果:

+--------------------+--------------------+
|         python_This|                 res|
+--------------------+--------------------+
|Beautiful is bett...|[beautiful, is, b...|
|Explicit is bette...|[explicit, is, be...|
|Simple is better ...|[simple, is, bett...|
|Complex is better...|[complex, is, bet...|
|Flat is better th...|[flat, is, better...|
|Sparse is better ...|[sparse, is, bett...|
+--------------------+--------------------+

04.详细看一行数据:

data.head(1)

​ 输出结果:

[Row(python_This='Beautiful is better than ugly', 
res=['beautiful', 'is', 'better', 'than', 'ugly'])]

05.查看数据数据的结构

data.printSchema()

​ 输出结果:

root
 |-- python_This: string (nullable = true)
 |-- res: array (nullable = true)
 |    |-- element: string (containsNull = true)
 类似资料: