当前位置: 首页 > 工具软件 > ML-NOTE > 使用案例 >

SparkML-note-Kmeans

司寇苗宣
2023-12-01

(本文为本人学习工作总结,如有雷同,不胜荣幸。可联系本人立即修改或者删除)

SparkML实现Kmeans

问题:什么是Kmeans算法?有什么用?怎么用?

解答

解析

聚类定义

(1)聚类就是对大量未标注的数据集,按数据的内在相似性将数据集划分为多个类别;
(2)数据没有类别标签,即没有训练数据和训练过程。是无监督学习(unsupervised learning );
(3)聚类之后,尽可能使类别内的数据相似度较大而类别间的数据相似度较小。

应用

1.不同维度鱼类聚类
2.多台集群日志数据按天多维度的聚类
3.用户偏好商品聚类

代码实现

//导包
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions.udf
import org.apache.spark.mllib.linalg.{Vector, Vectors, DenseVector, Matrices}
import org.apache.spark.mllib.stat.Statistics
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.evaluation.{RegressionMetrics, MulticlassMetrics}
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.regression.{LinearRegression, LinearRegressionModel}
import org.apache.spark.ml.classification.{LogisticRegression, LogisticRegressionModel}
import org.apache.spark.ml.classification.{DecisionTreeClassifier, DecisionTreeClassificationModel}
import org.apache.spark.ml.classification.{RandomForestClassifier, RandomForestClassificationModel}
import org.apache.spark.ml.clustering.{KMeans, KMeansModel}
import org.apache.spark.ml.feature.{Word2Vec, Word2VecModel}
import org.apache.spark.ml.feature.{StringIndexer, VectorIndexer}
import org.apache.spark.{SparkConf, SparkContext}
//导入数据
val labeleRdd = sql("select channel,region,fresh,milk,grocery,frozen,detergents_paper,delicasen from default.kmeansdemo")
  .map {
      case Row(channel:Double,region:Double,fresh:Doubel,milk:Double,grocery:Double,frozen:Double,detergents_paper:Double,delicassen:Double) => (1, Vectors.dense(channel,region,fresh,milk,grocery,frozen,detergents_paper,delicassen)}

val dataDF = sqlContext.createDataFrame(labeledRdd).toDF("label","features").randomSplit(Array(0.8, 1.0 - 0.8))
//数据分割
val trainingDF = dataDF(0)
val testDF = dataDF(1)
//构建模型
val method = new KMeans().setK(3).setInitMode("random").setSeed(10).setInitSteps(2).setMaxInter(3).setTol(3.0).setFeaturesCol("features").setPredictionCol("prediction")
//拟合模型
val model = method.fit(tariningDF)
model.write.overwrite().save("/model/KMeans_model")
val WCSS = model.computeCost(testDF)
//sql保存结果
sql("drop table if exists KMeansReportTable")
sql("create table if not exists KMeansReportTable (WCSS double)")
sql(s"insert into KMeansReportTable select t.* from (select $WCSS) t")

输出结果

//成功执行查询的误差较大
sql("select * from kmeansreporttable1").show()      
+-------------------+
|               wcss|
+-------------------+
|7.607988627877612E8|
+-------------------+

参考

1.http://spark.apache.org/docs/latest/ml-clustering.html#k-means

 类似资料: