Sparkmagic is a set of tools for interactively working with remote Spark clusters through Livy, a Spark REST server, in Jupyter notebooks.The Sparkmagic project includes a set of magics for interactively running Spark code in multiple languages, as well as some kernels that you can use to turn Jupyter into an integrated Spark environment.
sc
) and HiveContext (sqlContext
) creation%%sql
magic%%info
magic)There are two ways to use sparkmagic. Head over to the examples section for a demonstration on how to use both models of execution.
The sparkmagic library provides a %%spark magic that you can use to easily run code against a remote Spark cluster from a normal IPython notebook. See the Spark Magics on IPython sample notebook
The sparkmagic library also provides a set of Scala and Python kernels that allow you to automatically connect to a remote Spark cluster, run code and SQL queries, manage your Livy server and Spark job configuration, and generate automatic visualizations.See Pyspark and Spark sample notebooks.
See the Sending Local Data to Spark notebook.
Install the library
pip install sparkmagic
Make sure that ipywidgets is properly installed by running
jupyter nbextension enable --py --sys-prefix widgetsnbextension
If you're using JupyterLab, you'll need to run another command:
jupyter labextension install "@jupyter-widgets/jupyterlab-manager"
(Optional) Install the wrapper kernels. Do pip show sparkmagic
and it will show the path where sparkmagic
is installed at. cd
to that location and do:
jupyter-kernelspec install sparkmagic/kernels/sparkkernel
jupyter-kernelspec install sparkmagic/kernels/pysparkkernel
jupyter-kernelspec install sparkmagic/kernels/sparkrkernel
(Optional) Modify the configuration file at ~/.sparkmagic/config.json. Look at the example_config.json
(Optional) Enable the server extension so that clusters can be programatically changed:
jupyter serverextension enable --py sparkmagic
Sparkmagic supports:
The Authenticator is the mechanism for authenticating to Livy. The baseAuthenticator used by itself supports no auth, but it can be subclassed to enable authentication via other methods.Two such examples are the Basic and Kerberos Authenticators.
Kerberos support is implemented via the requests-kerberos package. Sparkmagic expects a kerberos ticket to be available in the system. Requests-kerberos will pick up the kerberos ticket from a cache file. For the ticket to be available, the user needs to have run kinit to create the kerberos ticket.
By default the HTTPKerberosAuth
constructor provided by the requests-kerberos
package will use the following configuration
HTTPKerberosAuth(mutual_authentication=REQUIRED)
but this will not be right configuration for every context, so it is able to pass custom arguments for this constructor using the following configuration on the ~/.sparkmagic/config.json
{
"kerberos_auth_configuration": {
"mutual_authentication": 1,
"service": "HTTP",
"delegate": false,
"force_preemptive": false,
"principal": "principal",
"hostname_override": "hostname_override",
"sanitize_mutual_error_response": true,
"send_cbt": true
}
}
You can write custom Authenticator subclasses to enable authentication via other mechanisms. All Authenticator subclassesshould override the Authenticator.__call__(request)
method that attaches HTTP Authentication to the given Request object.
Authenticator subclasses that add additional class attributes to be used for the authentication, such as the [Basic] (sparkmagic/sparkmagic/auth/basic.py) authenticator which adds username
and password
attributes, should override the __hash__
, __eq__
, update_with_widget_values
, and get_widgets
methods to work with these new attributes. This is necessary in order for the Authenticator to use these attributes in the authentication process.
If your repository layout is:
.
├── LICENSE
├── README.md
├── customauthenticator
│ ├── __init__.py
│ ├── customauthenticator.py
└── setup.py
Then to pip install from this repository, run: pip install git+https://git_repo_url/#egg=customauthenticator
After installing, you need to register the custom authenticator with Sparkmagic so it can be dynamically imported. This can be done in two different ways:
Edit the configuration file at ~/.sparkmagic/config.json
with the following settings:
{
"authenticators": {
"Kerberos": "sparkmagic.auth.kerberos.Kerberos",
"None": "sparkmagic.auth.customauth.Authenticator",
"Basic_Access": "sparkmagic.auth.basic.Basic",
"Custom_Auth": "customauthenticator.customauthenticator.CustomAuthenticator"
}
}
This adds your CustomAuthenticator
class in customauthenticator.py
to Sparkmagic. Custom_Auth
is the authentication type that will be displayed in the %manage_spark
widget's Auth type dropdown as well as the Auth type passed as an argument to the -t flag in the %spark add session
magic.
Modify the authenticators
method in sparkmagic/utils/configuration.py
to return your custom authenticator:
def authenticators():
return {
u"Kerberos": u"sparkmagic.auth.kerberos.Kerberos",
u"None": u"sparkmagic.auth.customauth.Authenticator",
u"Basic_Access": u"sparkmagic.auth.basic.Basic",
u"Custom_Auth": u"customauthenticator.customauthenticator.CustomAuthenticator"
}
If you want Papermill rendering to stop on a Spark error, edit the ~/.sparkmagic/config.json
with the following settings:
{
"shutdown_session_on_spark_statement_errors": true,
"all_errors_are_fatal": true
}
If you want any registered livy sessions to be cleaned up on exit regardless of whether the process exits gracefully or not, you can set:
{
"cleanup_all_sessions_on_exit": true,
"all_errors_are_fatal": true
}
In addition to the conf at ~/.sparkmagic/config.json
, sparkmagic conf can be overridden programmatically in a notebook.
For example:
import sparkmagic.utils.configuration as conf
conf.override('cleanup_all_sessions_on_exit', True)
Same thing, but referencing the conf member:
conf.override(conf.cleanup_all_sessions_on_exit.__name__, True)
NOTE: override for cleanup_all_sessions_on_exit
must be set before initializing sparkmagic ie. before this:
%load_ext sparkmagic.magics
The included docker-compose.yml
file will let you spin up a fullsparkmagic stack that includes a Jupyter notebook with the appropriateextensions installed, and a Livy server backed by a local-mode Spark instance.(This is just for testing and developing sparkmagic itself; in reality,sparkmagic is not very useful if your Spark instance is on the same machine!)
In order to use it, make sure you have Docker andDocker Compose both installed, andthen simply run:
docker-compose build
docker-compose up
You will then be able to access the Jupyter notebook in your browser athttp://localhost:8888. Inside this notebook, you can configure asparkmagic endpoint at http://spark:8998. This endpoint is able tolaunch both Scala and Python sessions. You can also choose to start awrapper kernel for Scala, Python, or R from the list of kernels.
To shut down the containers, you can interrupt docker-compose
withCtrl-C
, and optionally remove the containers with docker-compose down
.
If you are developing sparkmagic and want to test out your changes inthe Docker container without needing to push a version to PyPI, you canset the dev_mode
build arg in docker-compose.yml
to true
, and thenre-build the container. This will cause the container to install yourlocal version of autovizwidget, hdijupyterutils, and sparkmagic. Makesure to re-run docker-compose build
before each test run.
/reconnectsparkmagic
:POST
:Allows to specify Spark cluster connection information to a notebook passing in the notebook path and cluster information.Kernel will be started/restarted and connected to cluster specified.Request Body example:{ 'path': 'path.ipynb', 'username': 'username', 'password': 'password', 'endpoint': 'url', 'auth': 'Kerberos', 'kernelname': 'pysparkkernel' }
Note that the auth can be either None, Basic_Access or Kerberos based on the authentication enabled in livy. The kernelname parameter is optional and defaults to the one specified on the config file or pysparkkernel if not on the config file.Returns 200
if successful; 400
if body is not JSON string or key is not found; 500
if error is encountered changing clusters.
Reply Body example:{ 'success': true, 'error': null }
Sparkmagic uses Livy, a REST server for Spark, to remotely execute all user code.The library then automatically collects the output of your code as plain text or a JSON document, displaying the results to you as formatted text or as a Pandas dataframe as appropriate.
This architecture offers us some important advantages:
Run Spark code completely remotely; no Spark components need to be installed on the Jupyter server
Multi-language support; the Python, Python3, Scala and R kernels are equally feature-rich, and adding support for more languages will be easy
Support for multiple endpoints; you can use a single notebook to start multiple Spark jobs in different languages and against different remote clusters
Easy integration with any Python library for data science or visualization, like Pandas or Plotly
However, there are some important limitations to note:
Some overhead added by sending all code and output through Livy
Since all code is run on a remote driver through Livy, all structured data must be serialized to JSON and parsed by the Sparkmagic library so that it can be manipulated and visualized on the client side.In practice this means that you must use Python for client-side data manipulation in %%local
mode.
We welcome contributions from everyone.If you've made an improvement to our code, please send us a pull request.
To dev install, execute the following:
git clone https://github.com/jupyter-incubator/sparkmagic
pip install -e hdijupyterutils
pip install -e autovizwidget
pip install -e sparkmagic
and optionally follow steps 3 and 4 above.
To run unit tests, run:
nosetests hdijupyterutils autovizwidget sparkmagic
If you want to see an enhancement made but don't have time to work on it yourself, feel free to submit an issue for us to deal with.
引言: 目前数据分析人员常用到jupyterlab来进行前期的数据探索,但纯净版只支持简单的python,不能满足数据分析人员的需求,如何为数据分析人员提供大数据集群下的数据访问就成了需要解决的问题。 当前jupyter提供了一些官方kernel供用户使用,如最早的sparkmagic(https://github.com/jupyter-incubator/sparkmagic)和最近新的开源项
1、安装jupyter 2、安装sparkmagic 3、设置超时 "livy_server_heartbeat_timeout_seconds": 0, 4、设置集群模式 "session_configs": { "driverMemory": "2G", "executorCores": 4, "proxyUser": "bernhard", "conf": { "spark.master