当前位置：首页 > 软件库 > 神经网络/人工智能 > >

spark-r-notebooks

授权协议 View license

开发语言 Python

所属分类神经网络/人工智能

软件类型开源软件

地区不详

投递者容俊豪

操作系统跨平台

开源组织无

适用人群未知

软件概览

SparkR Notebooks

This is a collection of Jupyternotebooks intended to train the reader on different Apache Spark concepts, frombasic to advanced, by using the R language.

If your are interested in being introduced to some basic Data Science Engineering concepts and applications, you might find these series of tutorials interesting. There we explain different concepts and applicationsusing Python and R. Additionally, if you are interested in using Python with Spark, you can have a look at our pySpark notebooks.

Instructions

For these series of notebooks, we have used Jupyter with the IRkernel R kernel. You can find installation instructions for you specific setup here. Have also a look at Andrie de Vries post Using R with Jupyter Notebooks that includes instructions for installing Jupyter and IRkernel together.

A good way of using these notebooks is by first cloning the repo, and thenstarting your Jupyter in pySpark mode. For example,if we have a standalone Spark installation running in our localhost with amaximum of 6Gb per node assigned to IPython:

MASTER="spark://127.0.0.1:7077" SPARK_EXECUTOR_MEMORY="6G" IPYTHON_OPTS="notebook --pylab inline" ~/spark-1.5.0-bin-hadoop2.6/bin/pyspark

Notice that the path to the pyspark command will depend on your specificinstallation. So as requirement, you need to haveSpark installed inthe same machine you are going to start the IPython notebook server.

For more Spark options see here. In general it works the rule of passign optionsdescribed in the form spark.executor.memory as SPARK_EXECUTOR_MEMORY whencalling IPython/pySpark.

Datasets

2013 American Community Survey dataset.

Every year, the US Census Bureau runs the American Community Survey. In this survey, approximately 3.5 millionhouseholds are asked detailed questions about who they are and how they live. Many topics are covered, includingancestry, education, work, transportation, internet use, and residency. You can directly tothe sourcein order to know more about the data and get files for different years, longer periods, individual states, etc.

In any case, the starting up notebookwill download the 2013 data locally for later use with the rest of the notebooks.

The idea of using this dataset came from being recently announced in Kaggleas part of their Kaggle scripts datasets. There you will be able to analyse the dataset on site, while sharing your results with other Kaggleusers. Highly recommended!

Applications

Exploring geographical data with SparkR and ggplot2

An Exploratory Data Analysis of the 2013 American Community Survey dataset, more concretely its geographical features.

Contributing

Contributions are welcome! For bug reports or requests please submit an issue.

Contact

Feel free to contact me to discuss any issues, questions, or comments.

Twitter: @ja_dianes
GitHub: jadianes
LinkedIn: jadianes
Website: jadianes.me

License

This repository contains a variety of content; some developed by Jose A. Dianes, and some from third-parties. The third-party content is distributed under the license provided by those parties.

The content developed by Jose A. Dianes is distributed under the following license:

Copyright 2016 Jose A Dianes

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

使用案例

TensorflowOnSpark-单机版

GetStarted_Standalone(单机版) 参考别人部署： http://mp.weixin.qq.com/s/sLyjwU-FZqoOtiGasJnTZw https://github.com/yahoo/TensorFlowOnSpark/wiki/GetStarted_standalone http://blog.csdn.net/u013041398/article/detail
关于Spark

spark集群搭建下载spark安装包,注意与Hadoop版本的匹配.在/usr/local下创建spark文件夹，然后解压我们刚才下载的包: [root@DW1 spark]# tar -zxvf spark-2.4.3-bin-hadoop2.7.tgz [root@DW1 spark]# ls spark-2.4.3-bin-hadoop2.7 spark-2.4.3-bin-hado
cdh hue + sparksql（spark thriftserver）

cdh hue + sparksql（spark thriftserver) 参考hue:https://docs.gethue.com/administrator/configuration/connectors/ 1、由于cdh自带spark版本不支持spark-thrift服务，所以需要在官网下载自己所需的spark版本(http://spark.apache.org/downloads.h

spark-r-notebooks

SparkR Notebooks

Instructions

Datasets

2013 American Community Survey dataset.

Notebooks

Downloading data and starting with SparkR

SparkSQL basics with SparkR

Data frame operations with SparkSQL and SparkR

Exploratory Data Analysis with SparkR and ggplot2

Linear Models with SparkR

Applications

Exploring geographical data with SparkR and ggplot2

Contributing

Contact

License

同类工具

相关阅读

相关文章

相关问答

相关文档