前言:要学习spark程序开发,建议先学习spark-shell交互式学习,加深对spark程序开发的理解。spark-shell提供了一种学习API的简单方式,以及一个能够进行交互式分析数据的强大工具,可以使用scala编写(scala运行与Java虚拟机可以使用现有的Java库)或使用Python编写。
1.启动spark-shell
spark-shell的本质是在后台调用了spark-submit脚本来启动应用程序的,在spark-shell中已经创建了一个名为sc的SparkContext对象,在4个CPU核运行spark-shell命令如下:
spark-shell --master local[4]
如果指定Jar包路径,则命令如下:
spark-shell --master local[4] --jars xxx.jar,yyy,jar
--master用来设置context将要连接并使用的资源主节点,master的值是standalone模式中spark的集群地址、yarn或mesos集群的URL,或是一个local地址
--jars可以添加需要用到的jar包,通过逗号分隔来添加多个包。
2.加载text文件
spark创建sc后,可以加载本地文件创建RDD,这里测试是加载spark自带的本地文件README.md,返回一个MapPartitionsRDD文件。
scala> val textFile = sc.textFile("file:///opt/cloud/spark-2.1.1-bin-hadoop2.7/README.md");
textFile: org.apache.spark.rdd.RDD[String] = file:///opt/cloud/spark-2.1.1-bin-hadoop2.7/README.md MapPartitionsRDD[9] at textFile at <console>:24
加载HDFS文件和本地文件都是使用textFile,区别是添加前缀(hdfs://和file://)进行标识,从本地读取文件直接返回MapPartitionsRDD,而从HDFS读取的文件是先转成HadoopRDD,然后隐试转换成MapPartitionsRDD。想了解MapPartitions可以看这篇MapPartition和Map的区别。
3.简单RDD操作
对于RDD可以执行Transformation返回新的RDD,也可以执行Action得到返回结果。first命令返回文件第一行,count命令返回文件所有行数。
scala> textFile.first(); res6: String = # Apache Spark scala> textFile.count(); res7: Long = 104
接下来进行transformation操作,使用filter命令从README.md文件中抽取出一个子集,返回一个新的FilteredRDD。
scala> val textFilter = textFile.filter(line=>line.contains("Spark")); textFilter: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[16] at filter at <console>:26
链接多个Transformation和Action,计算包括"Spark"字符串的行数。
scala> textFile.filter(line=>line.contains("Spark")).count(); res10: Long = 20
4.RDD应用的简单操作
(1)计算文本中单词最多的一行的单词数
scala> textFile.map(line =>line.split(" ").size).reduce((a,b) => if (a > b) a else b); res11: Int = 22
先将每一行的单词使用空格进行拆分,并统计每一行的单词数,创建一个基于单词数的新RDD,然后对该RDD进行Reduce操作返回最大值。
(2)统计单词
词频统计WordCount是大数据处理最流行的入门程序之一,Spark可以很容易实现WordCount操作。
//这个过程返回的是一个(string,int)类型的键值对ShuffledRDD(y执行reduceByKey的时候需要进行Shuffle操作,返回的是一个Shuffle形式的RDD),最后用Collect聚合统计结果
scala> val wordCount = textFile.flatMap(line =>line.split(" ")).map(x => (x,1)).reduceByKey((a,b) => a+b); wordCount: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[23] at reduceByKey at <console>:26 scala> wordCount.collect [Stage 7:> (0 + 0)
[Stage 7:> (0 + 2)
res12: Array[(String, Int)] = Array((package,1), (this,1), (Version"](http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version),1), (Because,1), (Python,2), (page](http://spark.apache.org/documentation.html).,1), (cluster.,1), (its,1), ([run,1), (general,3), (have,1), (pre-built,1), (YARN,,1), ([http://spark.apache.org/developer-tools.html](the,1), (changed,1), (locally,2), (sc.parallelize(1,1), (only,1), (locally.,1), (several,1), (This,2), (basic,1), (Configuration,1), (learning,,1), (documentation,3), (first,1), (graph,1), (Hive,2), (info,1), (["Specifying,1), ("yarn",1), ([params]`.,1), ([project,1), (prefer,1), (SparkPi,2), (<http://spark.apache.org/>,1), (engine,1), (version,1), (file,1), (documentation,,1), (MASTER,1), (example,3), (["Parallel,1), (ar...
//这里使用了占位符_,使表达式更为简洁,是Scala语音的特色,每个_代表一个参数。
scala> val wordCount2 = textFile.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_); wordCount2: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[26] at reduceByKey at <console>:26 scala> wordCount2.collect res14: Array[(String, Int)] = Array((package,1), (this,1), (Version"](http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version),1), (Because,1), (Python,2), (page](http://spark.apache.org/documentation.html).,1), (cluster.,1), (its,1), ([run,1), (general,3), (have,1), (pre-built,1), (YARN,,1), ([http://spark.apache.org/developer-tools.html](the,1), (changed,1), (locally,2), (sc.parallelize(1,1), (only,1), (locally.,1), (several,1), (This,2), (basic,1), (Configuration,1), (learning,,1), (documentation,3), (first,1), (graph,1), (Hive,2), (info,1), (["Specifying,1), ("yarn",1), ([params]`.,1), ([project,1), (prefer,1), (SparkPi,2), (<http://spark.apache.org/>,1), (engine,1), (version,1), (file,1), (documentation,,1), (MASTER,1), (example,3), (["Parallel,1), (ar...
//Spark默认不进行排序,如有需要排序输出,排序的时候将key和value互换,使用sortByKey方法指定升序(true)和降序(false)
scala> val wordCount3 = textFile.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).map(x=>(x._2,x._1)).sortByKey(false).map(x=>(x._2,x._1)); wordCount3: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[34] at map at <console>:26 scala> wordCount3.collect res15: Array[(String, Int)] = Array(("",71), (the,24), (to,17), (Spark,16), (for,12), (##,9), (and,9), (a,8), (can,7), (run,7), (on,7), (is,6), (in,6), (using,5), (of,5), (build,4), (Please,4), (with,4), (also,4), (if,4), (including,4), (an,4), (You,4), (you,4), (general,3), (documentation,3), (example,3), (how,3), (one,3), (For,3), (use,3), (or,3), (see,3), (Hadoop,3), (Python,2), (locally,2), (This,2), (Hive,2), (SparkPi,2), (refer,2), (Interactive,2), (Scala,2), (detailed,2), (return,2), (Shell,2), (class,2), (Python,,2), (set,2), (building,2), (SQL,2), (guidance,2), (cluster,2), (shell:,2), (supports,2), (particular,2), (following,2), (which,2), (should,2), (To,2), (be,2), (do,2), (./bin/run-example,2), (It,2), (1000:,2), (tests,2), (examples,2), (at,2), (`examples`,2), (that,2), (H...
5.RDD缓存使用RDD的cache()方法