This is a collection of Jupyternotebooks intended to train the reader on different Apache Spark concepts, frombasic to advanced, by using the R language.
If your are interested in being introduced to some basic Data Science Engineering concepts and applications, you might find these series of tutorials interesting. There we explain different concepts and applicationsusing Python and R. Additionally, if you are interested in using Python with Spark, you can have a look at our pySpark notebooks.
For these series of notebooks, we have used Jupyter with the IRkernel R kernel. You can find installation instructions for you specific setup here. Have also a look at Andrie de Vries post Using R with Jupyter Notebooks that includes instructions for installing Jupyter and IRkernel together.
A good way of using these notebooks is by first cloning the repo, and thenstarting your Jupyter in pySpark mode. For example,if we have a standalone Spark installation running in our localhost
with amaximum of 6Gb per node assigned to IPython:
MASTER="spark://127.0.0.1:7077" SPARK_EXECUTOR_MEMORY="6G" IPYTHON_OPTS="notebook --pylab inline" ~/spark-1.5.0-bin-hadoop2.6/bin/pyspark
Notice that the path to the pyspark
command will depend on your specificinstallation. So as requirement, you need to haveSpark installed inthe same machine you are going to start the IPython notebook
server.
For more Spark options see here. In general it works the rule of passign optionsdescribed in the form spark.executor.memory
as SPARK_EXECUTOR_MEMORY
whencalling IPython/pySpark.
Every year, the US Census Bureau runs the American Community Survey. In this survey, approximately 3.5 millionhouseholds are asked detailed questions about who they are and how they live. Many topics are covered, includingancestry, education, work, transportation, internet use, and residency. You can directly tothe sourcein order to know more about the data and get files for different years, longer periods, individual states, etc.
In any case, the starting up notebookwill download the 2013 data locally for later use with the rest of the notebooks.
The idea of using this dataset came from being recently announced in Kaggleas part of their Kaggle scripts datasets. There you will be able to analyse the dataset on site, while sharing your results with other Kaggleusers. Highly recommended!
Where we download our data locally and start up a SparkR cluster.
About loading our data into SparkSQL data frames using SparkR.
Different operations we can use with SparkR and DataFrame
objects, such as data selection and filtering, aggregations, and sorting. The basis for exploratory data analysis and machine learning.
How to explore different types of variables using SparkR and ggplot2 charts.
About linear models using SparkR, its uses and current limitations in v1.5.
An Exploratory Data Analysis of the 2013 American Community Survey dataset, more concretely its geographical features.
Contributions are welcome! For bug reports or requests please submit an issue.
Feel free to contact me to discuss any issues, questions, or comments.
This repository contains a variety of content; some developed by Jose A. Dianes, and some from third-parties. The third-party content is distributed under the license provided by those parties.
The content developed by Jose A. Dianes is distributed under the following license:
Copyright 2016 Jose A Dianes
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
GetStarted_Standalone(单机版) 参考别人部署: http://mp.weixin.qq.com/s/sLyjwU-FZqoOtiGasJnTZw https://github.com/yahoo/TensorFlowOnSpark/wiki/GetStarted_standalone http://blog.csdn.net/u013041398/article/detail
spark集群搭建 下载spark安装包,注意与Hadoop版本的匹配.在/usr/local下创建spark文件夹,然后解压我们刚才下载的包: [root@DW1 spark]# tar -zxvf spark-2.4.3-bin-hadoop2.7.tgz [root@DW1 spark]# ls spark-2.4.3-bin-hadoop2.7 spark-2.4.3-bin-hado
cdh hue + sparksql(spark thriftserver) 参考hue:https://docs.gethue.com/administrator/configuration/connectors/ 1、由于cdh自带spark版本不支持spark-thrift服务,所以需要在官网下载自己所需的spark版本(http://spark.apache.org/downloads.h
如果我使用Spark将数据写入S3(或HDFS),我会得到一堆零件文件 <代码>第r部分-xxxxx-uuid。精炼的拼花地板 我知道xxxxx是一个map/减少任务编号,通常从零开始并向上计数。 是否存在任何有效、无错误的场景,其中会有part-r-00001输出文件,但没有part-r-00000输出文件?或者是一个part-r-00002输出文件,但没有part-r-00001文件? 我有一
我试图使用R将多个parquet文件加载到一个Spark表中。附加的代码显示了我是如何做到这一点的。
描述 (Description) 字符\r匹配回车符。 例子 (Example) 以下示例显示了字符匹配的用法。 package com.wenjiangs; import java.util.regex.Matcher; import java.util.regex.Pattern; public class CharactersDemo { private static final St
来源:Markdown写作浅谈 科技写作与Markdown+R 科技写作会碰到什么难题? 如果你是纯文科生,写的都是豆瓣小酸文或者诗歌之类的,那么,看完上面这一部分就可以打住了。如果你还有写科技论文的需要,则继续往下看。 科技写作与文艺写作的不同主要有: 公式与图表:相信各位写过科学论文的,都会为数学公式与各类图表的输出头疼不已; 格式转换:pdf是通用的,但是有时偏偏需要LaTeX原始格式或者W
R-OSGi 是一套适用于任意满足 OSGi 架构的分布式通讯组件。它以 jar 的形式发布,部署容易,使用也较为便捷。 概括下用户只需要完成如下几步。 在 Server 端: OSGi 容器内启动 R-OSGi 的 Bundle Service 的 Bundle 里 MENIFEST 文件中 import 对 R-OSGi 的引用 将需要被 Client 调用的接口暴露给 R-OSGi 模块即可
使用 (R)?ex 你可以从一个中心点通过完整的配置管理和软件发布流程来管理所有的机器 (R)?ex 使用 SSH 作为默认的传输层,无需在服务器端安装任何软件,只需要一个可使用 ssh 的帐号。