当前位置: 首页 > 软件库 > 神经网络/人工智能 > >

spark-r-notebooks

授权协议 View license
开发语言 Python
所属分类 神经网络/人工智能
软件类型 开源软件
地区 不详
投 递 者 容俊豪
操作系统 跨平台
开源组织
适用人群 未知
 软件概览

SparkR Notebooks

This is a collection of Jupyternotebooks intended to train the reader on different Apache Spark concepts, frombasic to advanced, by using the R language.

If your are interested in being introduced to some basic Data Science Engineering concepts and applications, you might find these series of tutorials interesting. There we explain different concepts and applicationsusing Python and R. Additionally, if you are interested in using Python with Spark, you can have a look at our pySpark notebooks.

Instructions

For these series of notebooks, we have used Jupyter with the IRkernel R kernel. You can find installation instructions for you specific setup here. Have also a look at Andrie de Vries post Using R with Jupyter Notebooks that includes instructions for installing Jupyter and IRkernel together.

A good way of using these notebooks is by first cloning the repo, and thenstarting your Jupyter in pySpark mode. For example,if we have a standalone Spark installation running in our localhost with amaximum of 6Gb per node assigned to IPython:

MASTER="spark://127.0.0.1:7077" SPARK_EXECUTOR_MEMORY="6G" IPYTHON_OPTS="notebook --pylab inline" ~/spark-1.5.0-bin-hadoop2.6/bin/pyspark

Notice that the path to the pyspark command will depend on your specificinstallation. So as requirement, you need to haveSpark installed inthe same machine you are going to start the IPython notebook server.

For more Spark options see here. In general it works the rule of passign optionsdescribed in the form spark.executor.memory as SPARK_EXECUTOR_MEMORY whencalling IPython/pySpark.

Datasets

2013 American Community Survey dataset.

Every year, the US Census Bureau runs the American Community Survey. In this survey, approximately 3.5 millionhouseholds are asked detailed questions about who they are and how they live. Many topics are covered, includingancestry, education, work, transportation, internet use, and residency. You can directly tothe sourcein order to know more about the data and get files for different years, longer periods, individual states, etc.

In any case, the starting up notebookwill download the 2013 data locally for later use with the rest of the notebooks.

The idea of using this dataset came from being recently announced in Kaggleas part of their Kaggle scripts datasets. There you will be able to analyse the dataset on site, while sharing your results with other Kaggleusers. Highly recommended!

Notebooks

Downloading data and starting with SparkR

Where we download our data locally and start up a SparkR cluster.

SparkSQL basics with SparkR

About loading our data into SparkSQL data frames using SparkR.

Data frame operations with SparkSQL and SparkR

Different operations we can use with SparkR and DataFrame objects, such as data selection and filtering, aggregations, and sorting. The basis for exploratory data analysis and machine learning.

Exploratory Data Analysis with SparkR and ggplot2

How to explore different types of variables using SparkR and ggplot2 charts.

Linear Models with SparkR

About linear models using SparkR, its uses and current limitations in v1.5.

Applications

Exploring geographical data with SparkR and ggplot2

An Exploratory Data Analysis of the 2013 American Community Survey dataset, more concretely its geographical features.

Contributing

Contributions are welcome! For bug reports or requests please submit an issue.

Contact

Feel free to contact me to discuss any issues, questions, or comments.

License

This repository contains a variety of content; some developed by Jose A. Dianes, and some from third-parties. The third-party content is distributed under the license provided by those parties.

The content developed by Jose A. Dianes is distributed under the following license:

Copyright 2016 Jose A Dianes

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
  • GetStarted_Standalone(单机版) 参考别人部署: http://mp.weixin.qq.com/s/sLyjwU-FZqoOtiGasJnTZw https://github.com/yahoo/TensorFlowOnSpark/wiki/GetStarted_standalone http://blog.csdn.net/u013041398/article/detail

  • spark集群搭建 下载spark安装包,注意与Hadoop版本的匹配.在/usr/local下创建spark文件夹,然后解压我们刚才下载的包: [root@DW1 spark]# tar -zxvf spark-2.4.3-bin-hadoop2.7.tgz [root@DW1 spark]# ls spark-2.4.3-bin-hadoop2.7 spark-2.4.3-bin-hado

  • cdh hue + sparksql(spark thriftserver) 参考hue:https://docs.gethue.com/administrator/configuration/connectors/ 1、由于cdh自带spark版本不支持spark-thrift服务,所以需要在官网下载自己所需的spark版本(http://spark.apache.org/downloads.h

 相关资料
  • 如果我使用Spark将数据写入S3(或HDFS),我会得到一堆零件文件 <代码>第r部分-xxxxx-uuid。精炼的拼花地板 我知道xxxxx是一个map/减少任务编号,通常从零开始并向上计数。 是否存在任何有效、无错误的场景,其中会有part-r-00001输出文件,但没有part-r-00000输出文件?或者是一个part-r-00002输出文件,但没有part-r-00001文件? 我有一

  • 我试图使用R将多个parquet文件加载到一个Spark表中。附加的代码显示了我是如何做到这一点的。

  • \r

    描述 (Description) 字符\r匹配回车符。 例子 (Example) 以下示例显示了字符匹配的用法。 package com.wenjiangs; import java.util.regex.Matcher; import java.util.regex.Pattern; public class CharactersDemo { private static final St

  • 来源:Markdown写作浅谈 科技写作与Markdown+R 科技写作会碰到什么难题? 如果你是纯文科生,写的都是豆瓣小酸文或者诗歌之类的,那么,看完上面这一部分就可以打住了。如果你还有写科技论文的需要,则继续往下看。 科技写作与文艺写作的不同主要有: 公式与图表:相信各位写过科学论文的,都会为数学公式与各类图表的输出头疼不已; 格式转换:pdf是通用的,但是有时偏偏需要LaTeX原始格式或者W

  • R-OSGi 是一套适用于任意满足 OSGi 架构的分布式通讯组件。它以 jar 的形式发布,部署容易,使用也较为便捷。 概括下用户只需要完成如下几步。 在 Server 端: OSGi 容器内启动 R-OSGi 的 Bundle Service 的 Bundle 里 MENIFEST 文件中 import 对 R-OSGi 的引用 将需要被 Client 调用的接口暴露给 R-OSGi 模块即可

  • 使用 (R)?ex 你可以从一个中心点通过完整的配置管理和软件发布流程来管理所有的机器 (R)?ex 使用 SSH 作为默认的传输层,无需在服务器端安装任何软件,只需要一个可使用 ssh 的帐号。