Tokenizer是一个分词器
Tokenizer是将文本如一个句子拆分成单词的过程,在spark ml中提供Tokenizer实现此功能RegexTokenizer提供了跟高级的基于正则表达式匹配的单词拆分
默认情况下:
参数pattern(默认的正则表达式:"\s+") 作为分隔符用于拆分输入的文本
用户将可将参数 gaps设置为false,指定正则表达式pattern表示为tokens,而不是分隔符,这样作为划分结果 找到的所有匹配项
01.导入模块,创建对象
from pyspark.sql import SparkSession
spark = SparkSession.builder.config("spark.driver.host","192.168.1.4")\
.config("spark.ui.showConsoleProgress","false")\
.appName("Tokenizer").master("local[*]").getOrCreate()
02.以python之禅的部分内容作为文本(import this)
data = spark.createDataFrame([
("Beautiful is better than ugly",),
("Explicit is better than implicit",),
("Simple is better than complex",),
("Complex is better than complicated",),
("Flat is better than nested",),
("Sparse is better than dense",)
],["python_This"])
data.show()
输出结果:
+--------------------+
| python_This|
+--------------------+
|Beautiful is bett...|
|Explicit is bette...|
|Simple is better ...|
|Complex is better...|
|Flat is better th...|
|Sparse is better ...|
+--------------------+
03.使用Tokenizer分词器,转换数据
from pyspark.ml.feature import Tokenizer
tokenizer = Tokenizer(inputCol="python_This",outputCol="res")
data = tokenizer.transform(data)
data.show()
输出结果:
+--------------------+--------------------+
| python_This| res|
+--------------------+--------------------+
|Beautiful is bett...|[beautiful, is, b...|
|Explicit is bette...|[explicit, is, be...|
|Simple is better ...|[simple, is, bett...|
|Complex is better...|[complex, is, bet...|
|Flat is better th...|[flat, is, better...|
|Sparse is better ...|[sparse, is, bett...|
+--------------------+--------------------+
04.详细看一行数据:
data.head(1)
输出结果:
[Row(python_This='Beautiful is better than ugly',
res=['beautiful', 'is', 'better', 'than', 'ugly'])]
05.查看数据数据的结构
data.printSchema()
输出结果:
root
|-- python_This: string (nullable = true)
|-- res: array (nullable = true)
| |-- element: string (containsNull = true)