(本文为本人学习工作总结,如有雷同,不胜荣幸。可联系本人立即修改或者删除)
(1)聚类就是对大量未标注的数据集,按数据的内在相似性将数据集划分为多个类别;
(2)数据没有类别标签,即没有训练数据和训练过程。是无监督学习(unsupervised learning );
(3)聚类之后,尽可能使类别内的数据相似度较大而类别间的数据相似度较小。
1.不同维度鱼类聚类
2.多台集群日志数据按天多维度的聚类
3.用户偏好商品聚类
//导包
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions.udf
import org.apache.spark.mllib.linalg.{Vector, Vectors, DenseVector, Matrices}
import org.apache.spark.mllib.stat.Statistics
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.evaluation.{RegressionMetrics, MulticlassMetrics}
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.regression.{LinearRegression, LinearRegressionModel}
import org.apache.spark.ml.classification.{LogisticRegression, LogisticRegressionModel}
import org.apache.spark.ml.classification.{DecisionTreeClassifier, DecisionTreeClassificationModel}
import org.apache.spark.ml.classification.{RandomForestClassifier, RandomForestClassificationModel}
import org.apache.spark.ml.clustering.{KMeans, KMeansModel}
import org.apache.spark.ml.feature.{Word2Vec, Word2VecModel}
import org.apache.spark.ml.feature.{StringIndexer, VectorIndexer}
import org.apache.spark.{SparkConf, SparkContext}
//导入数据
val labeleRdd = sql("select channel,region,fresh,milk,grocery,frozen,detergents_paper,delicasen from default.kmeansdemo")
.map {
case Row(channel:Double,region:Double,fresh:Doubel,milk:Double,grocery:Double,frozen:Double,detergents_paper:Double,delicassen:Double) => (1, Vectors.dense(channel,region,fresh,milk,grocery,frozen,detergents_paper,delicassen)}
val dataDF = sqlContext.createDataFrame(labeledRdd).toDF("label","features").randomSplit(Array(0.8, 1.0 - 0.8))
//数据分割
val trainingDF = dataDF(0)
val testDF = dataDF(1)
//构建模型
val method = new KMeans().setK(3).setInitMode("random").setSeed(10).setInitSteps(2).setMaxInter(3).setTol(3.0).setFeaturesCol("features").setPredictionCol("prediction")
//拟合模型
val model = method.fit(tariningDF)
model.write.overwrite().save("/model/KMeans_model")
val WCSS = model.computeCost(testDF)
//sql保存结果
sql("drop table if exists KMeansReportTable")
sql("create table if not exists KMeansReportTable (WCSS double)")
sql(s"insert into KMeansReportTable select t.* from (select $WCSS) t")
输出结果
//成功执行查询的误差较大
sql("select * from kmeansreporttable1").show()
+-------------------+
| wcss|
+-------------------+
|7.607988627877612E8|
+-------------------+
1.http://spark.apache.org/docs/latest/ml-clustering.html#k-means