问题：

Spark中的广播散列连接(BHJ)用于完整的外部连接（外部的、完整的、完整的）

咸正平

2023-03-14

如何强制spark中数据包的完全外部联接以使用Boradcast散列联接？下面是代码片段：

sparkConfiguration.set("spark.sql.autoBroadcastJoinThreshold", "1000000000")
val Result = BigTable.join(
  org.apache.spark.sql.functions.broadcast(SmallTable),
  Seq("X", "Y", "Z", "W", "V"),
  "outer"
)

但是，当我使用“outer”作为联接类型时，spark出于某种未知原因决定使用sortmergejoin。有人知道怎么解决这个问题吗？根据我在左外部联接中看到的性能，BroadcasThashjoin将有助于加快应用程序的速度。

伍捷

2023-03-14

spark出于某种未知的原因决定使用SortMergeJoin。有人知道怎么解决这个问题吗？

原因：FullOuter（指任何关键字outer、full、FullOuter)不支持广播散列连接（也就是map side连接）

如何证明这一点？

package com.examples

import org.apache.log4j.{Level, Logger}
import org.apache.spark.internal.Logging
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

/**
  * Join Example and some basics demonstration using sample data.
  *
  * @author : Ram Ghadiyaram
  */
object JoinExamples extends Logging {
  // switch off  un necessary logs
  Logger.getLogger("org").setLevel(Level.OFF)
   val spark: SparkSession = SparkSession.builder.config("spark.master", "local").getOrCreate;
  case class Person(name: String, age: Int, personid: Int)

  case class Profile(name: String, personId: Int, profileDescription: String)

  /**
    * main
    *
    * @param args Array[String]
    */
  def main(args: Array[String]): Unit = {
    spark.conf.set("spark.sql.join.preferSortMergeJoin", "false")
    import spark.implicits._

    spark.sparkContext.getConf.getAllWithPrefix("spark.sql").foreach(x => logInfo(x.toString()))
    /**
      * create 2 dataframes here using case classes one is Person df1 and another one is profile df2
      */
    val df1 = spark.sqlContext.createDataFrame(
      spark.sparkContext.parallelize(
        Person("Sarath", 33, 2)
          :: Person("KangarooWest", 30, 2)
          :: Person("Ravikumar Ramasamy", 34, 5)
          :: Person("Ram Ghadiyaram", 42, 9)
          :: Person("Ravi chandra Kancharla", 43, 9)
          :: Nil))


    val df2 = spark.sqlContext.createDataFrame(
      Profile("Spark", 2, "SparkSQLMaster")
        :: Profile("Spark", 5, "SparkGuru")
        :: Profile("Spark", 9, "DevHunter")
        :: Nil
    )

    // you can do alias to refer column name with aliases to  increase readablity

    val df_asPerson = df1.as("dfperson")
    val df_asProfile = df2.as("dfprofile")
    /** *
      * Example displays how to join them in the dataframe level
      * next example demonstrates using sql with createOrReplaceTempView
      */
    val joined_df = df_asPerson.join(
      broadcast(df_asProfile)
      , col("dfperson.personid") === col("dfprofile.personid")
      , "outer")
    val joined = joined_df.select(
      col("dfperson.name")
      , col("dfperson.age")
      , col("dfprofile.name")
      , col("dfprofile.profileDescription"))
    joined.explain(false) // it will show which join was used
    joined.show

  }

}

== Physical Plan ==
*Project [name#4, age#5, name#11, profileDescription#13]
+- SortMergeJoin [personid#6], [personid#12], FullOuter
   :- *Sort [personid#6 ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(personid#6, 200)
   :     +- *SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, com.examples.JoinExamples$Person, true]).name, true) AS name#4, assertnotnull(input[0, com.examples.JoinExamples$Person, true]).age AS age#5, assertnotnull(input[0, com.examples.JoinExamples$Person, true]).personid AS personid#6]
   :        +- Scan ExternalRDDScan[obj#3]
   +- *Sort [personid#12 ASC NULLS FIRST], false, 0
      +- Exchange hashpartitioning(personid#12, 200)
         +- LocalTableScan [name#11, personId#12, profileDescription#13]
+--------------------+---+-----+------------------+
|                name|age| name|profileDescription|
+--------------------+---+-----+------------------+
|  Ravikumar Ramasamy| 34|Spark|         SparkGuru|
|      Ram Ghadiyaram| 42|Spark|         DevHunter|
|Ravi chandra Kanc...| 43|Spark|         DevHunter|
|              Sarath| 33|Spark|    SparkSQLMaster|
|        KangarooWest| 30|Spark|    SparkSQLMaster|
+--------------------+---+-----+------------------+

sparkSession.conf.set("spark.sql.join.preferSortMergeJoin", "false")

这是编写sparkstrategies.scala（负责将逻辑计划转换为零个或多个SparkPlans）的说明，您不想使用sortmergejoin。

此属性spark.sql.join.preferSortMergeJoin如果为true，则更喜欢通过此PREFER_SORTMERGEJOIN属性进行排序合并联接，而不是shuffle hash联接。

设置false意味着spark不能只选择broadcasthashjoin，它也可以是其他任何东西（例如shuffle hash join）。

broadcast：如果连接的一侧的估计物理大小小于用户可配置的[[sqlconf.auto_broadcastjoin_threshold]]阈值，或者如果该一侧有显式的广播提示（例如，用户将[[org.apache.spark.sql.functions.broadcast()]]函数应用于dataframe)，那么连接的这一侧将被广播，另一侧将被流式传输，不执行洗牌。如果连接的双方都有资格广播，则

洗牌哈希联接：如果单个分区的平均大小足够小，可以构建哈希表。

排序合并：如果匹配的联接键是可排序的。

Spark中的广播散列连接(BHJ)用于完整的外部连接（外部的、完整的、完整的）

共有1个答案

相关问答

相关文章

相关阅读

相关工具

相关文档