spark sql中的sqlcontext与hivecontext区别

家志学

2023-12-01

很困惑这两者有什么区别，然后谷歌。
One of Sparks’s modules is SparkSQL. SparkSQL can be used to process structured data, so with SparkSQL your data must have a defined schema. In Spark 1.3.1, SparkSQL implements dataframes and a SQL query engine. SparkSQL has a SQLContext and a HiveContext. HiveContext is a super set of the SQLContext. Hortonworks and the Spark community suggest using the HiveContext. You can see below that when you run spark-shell, which is your interactive driver application, it automatically creates a SparkContext defined as sc and a HiveContext defined as sqlContext. The HiveContext allows you to execute SQL queries as well as Hive commands. The same behavior occurs for pyspark. You can review the Spark 1.3.1 documentation for SQLContext and HiveContext at SQLContext documentation and HiveContext documentation.
原文https://blogs.msdn.microsoft.com/bigdatasupport/2015/09/14/understanding-sparks-sparkconf-sparkcontext-sqlcontext-and-hivecontext/

Spark 2.0 provides native window functions (SPARK-8641) and features some additional improvements in parsing and much better SQL 2003 compliance so it is significantly less dependent on Hive to achieve core funcionality and because of that HiveContext seems to be slightly less important.

Spark < 2.0

Obviously if you want to work with Hive you have to use HiveContext. Beyond that the biggest difference as for now (Spark 1.5) is a support for window functions and ability to access Hive UDFs.

Generally speaking window functions are a pretty cool feature and can be used to solve quite complex problems in a concise way without going back and forth between RDDs and DataFrames. Performance is still far from optimal especially without PARTITION BY clause but it is really nothing Spark specific.

Regarding Hive UDFs it is not a serious issue now, but before Spark 1.5 many SQL functions have been expressed using Hive UDFs and required HiveContext to work.

HiveContext also provides more robust SQL parser. See for example: py4j.protocol.Py4JJavaError when selecting nested column in dataframe using select statetment

Finally HiveContext is required to start Thrift server.

The biggest problem with HiveContext is that it comes with large dependencies.
原文：
http://stackoverflow.com/questions/33666545/what-is-the-difference-between-apache-spark-sqlcontext-vs-hivecontext

还有听一个人说，是因为企业目前都是用hive

spark sql中的sqlcontext与hivecontext区别

相关阅读

相关文章

相关问答

相关文档