当前位置: 首页 > 知识库问答 >
问题:

用scala将spark作业从eclipse提交到yarn-client

冯旭
2023-03-14

我是spark和scala的新手,我很难提交一份作为YARN客户的spark工作。通过spark shell(spark submit)执行此操作没有问题:首先在eclipse中创建一个spark作业,然后将其编译到jar中,并通过内核shell使用spark submit,如下所示:

 spark-submit --class ebicus.WordCount /u01/stage/mvn_test-0.0.1.jar

然而,使用Eclipse直接编译并将其提交给YARN似乎很困难。

我的项目设置如下:我的集群运行CDH cloudera 5.6。我有一个使用scala的maven项目,我的类路径/和我的pom.xml一起在sinc中

package test

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.hadoop.conf.Configuration;
import org.apache.spark.deploy.yarn.Client;
import org.apache.spark.TaskContext;
import akka.actor
import org.apache.spark.deploy.yarn.ClientArguments
import org.apache.spark.deploy.ClientArguments

object WordCount {

  def main(args: Array[String]): Unit = {
//    val workaround = new File(".");
    System.getProperties().put("hadoop.home.dir",  "c:\\winutil\\");
    System.setProperty("SPARK_YARN_MODE", "true");

   val conf = new SparkConf()
      .setAppName("WordCount")
      .setMaster("yarn-client")
      .set("spark.hadoop.fs.defaultFS", "hdfs://namecluster.com:8020/user/username")
      .set("spark.hadoop.dfs.nameservices", "namecluster.com:8020")
      .set("spark.hadoop.yarn.resourcemanager.hostname", "namecluster.com")
      .set("spark.hadoop.yarn.resourcemanager.address", "namecluster:8032")
      .set("spark.hadoop.yarn.application.classpath",
              "/etc/hadoop/conf,"
          +"/usr/lib/hadoop/*,"
          +"/usr/lib/hadoop/lib/*,"
          +"/usr/lib/hadoop-hdfs/*,"
          +"/usr/lib/hadoop-hdfs/lib/*,"
          +"/usr/lib/hadoop-mapreduce/*,"
          +"/usr/lib/hadoop-mapreduce/lib/*,"
          +"/usr/lib/hadoop-yarn/*,"
          +"/usr/lib/hadoop-yarn/lib/*,"
          +"/usr/lib/spark/*,"
          +"/usr/lib/spark/lib/*,"
          +"/usr/lib/spark/lib/*"
      )
      .set("spark.driver.host","localhost");

    val sc = new SparkContext(conf);

    val file = sc.textFile("hdfs://namecluster.com:8020/user/root/testdir/test.csv")
    //Count number of words from a hive table (split is based on char 001)
    val counts = file.flatMap(line => line.split(1.toChar)).map(word => (word, 1)).reduceByKey(_ + _)

    //swap key and value with count value and sort from high to low 
    val test = counts.map(_.swap).sortBy(word =>(word,1), false, 5)

    test.saveAsTextFile("hdfs://namecluster.com:8020/user/root/test1")

  }

}
YARN executor launch context:
  env:
    CLASSPATH -> {{PWD}}<CPS>{{PWD}}/__spark__.jar<CPS>/etc/hadoop/conf<CPS>/usr/lib/hadoop/*<CPS>/usr/lib/hadoop/lib/*<CPS>/usr/lib/hadoop-hdfs/*<CPS>/usr/lib/hadoop-hdfs/lib/*<CPS>/usr/lib/hadoop-mapreduce/*<CPS>/usr/lib/hadoop-mapreduce/lib/*<CPS>/usr/lib/hadoop-yarn/*<CPS>/usr/lib/hadoop-yarn/lib/*<CPS>/usr/lib/spark/*<CPS>/usr/lib/spark/lib/*<CPS>/usr/lib/spark/lib/*<CPS>$HADOOP_MAPRED_HOME/*<CPS>$HADOOP_MAPRED_HOME/lib/*<CPS>$MR2_CLASSPATH
    SPARK_LOG_URL_STDERR -> http://cloudera-002.fusion.ebicus.com:8042/node/containerlogs/container_1461679867178_0026_01_000005/hadriaans/stderr?start=-4096
    SPARK_YARN_STAGING_DIR -> .sparkStaging/application_1461679867178_0026
    SPARK_YARN_CACHE_FILES_FILE_SIZES -> 520473
    SPARK_USER -> hadriaans
    SPARK_YARN_CACHE_FILES_VISIBILITIES -> PRIVATE
    SPARK_YARN_MODE -> true
    SPARK_YARN_CACHE_FILES_TIME_STAMPS -> 1462288779267
    SPARK_LOG_URL_STDOUT -> http://cloudera-002.fusion.ebicus.com:8042/node/containerlogs/container_1461679867178_0026_01_000005/hadriaans/stdout?start=-4096
    SPARK_YARN_CACHE_FILES -> hdfs://cloudera-003.fusion.ebicus.com:8020/user/hadriaans/.sparkStaging/application_1461679867178_0026/spark-yarn_2.10-1.5.0.jar#__spark__.jar

  command:
    {{JAVA_HOME}}/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms1024m -Xmx1024m -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.driver.port=49961' -Dspark.yarn.app.container.log.dir=<LOG_DIR> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url akka.tcp://sparkDriver@10.29.51.113:49961/user/CoarseGrainedScheduler --executor-id 4 --hostname cloudera-002.fusion.ebicus.com --cores 1 --app-id application_1461679867178_0026 --user-class-path file:$PWD/__app__.jar 1> <LOG_DIR>/stdout 2> <LOG_DIR>/stderr
===============================================================================

16/05/03 17:19:58 INFO impl.ContainerManagementProtocolProxy: Opening proxy : cloudera-002.fusion.ebicus.com:8041
16/05/03 17:20:01 INFO yarn.YarnAllocator: Completed container container_1461679867178_0026_01_000005 (state: COMPLETE, exit status: 1)
16/05/03 17:20:01 INFO yarn.YarnAllocator: Container marked as failed: container_1461679867178_0026_01_000005. Exit status: 1. Diagnostics: Exception from container-launch.
Container id: container_1461679867178_0026_01_000005
Exit code: 1
Stack trace: ExitCodeException exitCode=1: 
    at org.apache.hadoop.util.Shell.runCommand(Shell.java:561)
    at org.apache.hadoop.util.Shell.run(Shell.java:478)
    at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:738)
    at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:210)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
    at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

共有1个答案

宗政小林
2023-03-14

根据我以前的经验,有两种可能的情况可能会导致这个不是很描述性的错误(我从Eclipse提交作业,但使用Java)

>

  • 我注意到您没有将JAR传递给SparkContext的配置。如果我删除从Eclipse提交时传递JAR的行,我的代码将以完全相同的错误失败。因此,基本上您将尚未存在的JAR的路径设置到代码中,然后将项目导出为可运行的JAR,这将把所有的传递依赖项打包到其中,并将其转换到您先前在代码中设置的路径中。这是它在Java中的外观:

    SparkConf sparkConfiguration=new SparkConf();
    sparkConfiguration.setjars(new string[]{“path to your jar”});

  •  类似资料:
    • 问题内容: 但是有很多歧义和提供的一些答案…包括在jars / executor / driver配置或选项中复制jar引用。 How ClassPath is affected Driver Executor (for tasks running) Both not at all Separation character: comma, colon, semicolon If provided

    • 真的...已经讨论了很多。 然而,有很多模棱两可之处,提供的一些答案。。。包括在JAR/执行器/驱动程序配置或选项中复制JAR引用。 应为每个选项澄清以下歧义、不清楚和/或省略的细节: 类路径如何受到影响 驾驶员 执行器(用于正在运行的任务) 两者都有 一点也不 对于任务(给每个执行者) 方法 方法 或 ,或者 别忘了,spack-提交的最后一个参数也是一个. jar文件。 我知道在哪里可以找到主

    • 我被困在: 在我得到这个之前: 当我签出应用程序跟踪页面时,我在stderr上得到以下信息: 我对这一切都很陌生,也许我的推理有缺陷,任何投入或建议都会有所帮助。

    • 也许一定有一个更合适的方式来提交火花工作。有人知道如何将Apache Spark作业远程提交到hDinsight吗? 多谢!

    • im关注亚马逊文档,向emr集群提交spark作业https://aws.amazon.com/premiumsupport/knowledge-center/emr-submit-spark-job-remote-cluster/ 在按照说明进行操作后,使用frecuent进行故障排除,它由于未解析的地址与消息类似而失败。 错误火花。SparkContext:初始化SparkContext时出错

    • 我对Spark很陌生,目前正在通过玩pyspark和Spark-Shell来探索它。 现在的情况是,我用pyspark和Spark-Shell运行相同的spark作业。 这是来自Pyspark: 使用spark-shell,工作在25分钟内完成,使用pyspark大约55分钟。如何让Spark独立地用pyspark分配任务,就像它用Spark-shell分配任务一样?