我是spark和scala的新手,我很难提交一份作为YARN客户的spark工作。通过spark shell(spark submit)执行此操作没有问题:首先在eclipse中创建一个spark作业,然后将其编译到jar中,并通过内核shell使用spark submit,如下所示:
spark-submit --class ebicus.WordCount /u01/stage/mvn_test-0.0.1.jar
然而,使用Eclipse直接编译并将其提交给YARN似乎很困难。
我的项目设置如下:我的集群运行CDH cloudera 5.6。我有一个使用scala的maven项目,我的类路径/和我的pom.xml一起在sinc中
package test
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.hadoop.conf.Configuration;
import org.apache.spark.deploy.yarn.Client;
import org.apache.spark.TaskContext;
import akka.actor
import org.apache.spark.deploy.yarn.ClientArguments
import org.apache.spark.deploy.ClientArguments
object WordCount {
def main(args: Array[String]): Unit = {
// val workaround = new File(".");
System.getProperties().put("hadoop.home.dir", "c:\\winutil\\");
System.setProperty("SPARK_YARN_MODE", "true");
val conf = new SparkConf()
.setAppName("WordCount")
.setMaster("yarn-client")
.set("spark.hadoop.fs.defaultFS", "hdfs://namecluster.com:8020/user/username")
.set("spark.hadoop.dfs.nameservices", "namecluster.com:8020")
.set("spark.hadoop.yarn.resourcemanager.hostname", "namecluster.com")
.set("spark.hadoop.yarn.resourcemanager.address", "namecluster:8032")
.set("spark.hadoop.yarn.application.classpath",
"/etc/hadoop/conf,"
+"/usr/lib/hadoop/*,"
+"/usr/lib/hadoop/lib/*,"
+"/usr/lib/hadoop-hdfs/*,"
+"/usr/lib/hadoop-hdfs/lib/*,"
+"/usr/lib/hadoop-mapreduce/*,"
+"/usr/lib/hadoop-mapreduce/lib/*,"
+"/usr/lib/hadoop-yarn/*,"
+"/usr/lib/hadoop-yarn/lib/*,"
+"/usr/lib/spark/*,"
+"/usr/lib/spark/lib/*,"
+"/usr/lib/spark/lib/*"
)
.set("spark.driver.host","localhost");
val sc = new SparkContext(conf);
val file = sc.textFile("hdfs://namecluster.com:8020/user/root/testdir/test.csv")
//Count number of words from a hive table (split is based on char 001)
val counts = file.flatMap(line => line.split(1.toChar)).map(word => (word, 1)).reduceByKey(_ + _)
//swap key and value with count value and sort from high to low
val test = counts.map(_.swap).sortBy(word =>(word,1), false, 5)
test.saveAsTextFile("hdfs://namecluster.com:8020/user/root/test1")
}
}
YARN executor launch context:
env:
CLASSPATH -> {{PWD}}<CPS>{{PWD}}/__spark__.jar<CPS>/etc/hadoop/conf<CPS>/usr/lib/hadoop/*<CPS>/usr/lib/hadoop/lib/*<CPS>/usr/lib/hadoop-hdfs/*<CPS>/usr/lib/hadoop-hdfs/lib/*<CPS>/usr/lib/hadoop-mapreduce/*<CPS>/usr/lib/hadoop-mapreduce/lib/*<CPS>/usr/lib/hadoop-yarn/*<CPS>/usr/lib/hadoop-yarn/lib/*<CPS>/usr/lib/spark/*<CPS>/usr/lib/spark/lib/*<CPS>/usr/lib/spark/lib/*<CPS>$HADOOP_MAPRED_HOME/*<CPS>$HADOOP_MAPRED_HOME/lib/*<CPS>$MR2_CLASSPATH
SPARK_LOG_URL_STDERR -> http://cloudera-002.fusion.ebicus.com:8042/node/containerlogs/container_1461679867178_0026_01_000005/hadriaans/stderr?start=-4096
SPARK_YARN_STAGING_DIR -> .sparkStaging/application_1461679867178_0026
SPARK_YARN_CACHE_FILES_FILE_SIZES -> 520473
SPARK_USER -> hadriaans
SPARK_YARN_CACHE_FILES_VISIBILITIES -> PRIVATE
SPARK_YARN_MODE -> true
SPARK_YARN_CACHE_FILES_TIME_STAMPS -> 1462288779267
SPARK_LOG_URL_STDOUT -> http://cloudera-002.fusion.ebicus.com:8042/node/containerlogs/container_1461679867178_0026_01_000005/hadriaans/stdout?start=-4096
SPARK_YARN_CACHE_FILES -> hdfs://cloudera-003.fusion.ebicus.com:8020/user/hadriaans/.sparkStaging/application_1461679867178_0026/spark-yarn_2.10-1.5.0.jar#__spark__.jar
command:
{{JAVA_HOME}}/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms1024m -Xmx1024m -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.driver.port=49961' -Dspark.yarn.app.container.log.dir=<LOG_DIR> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url akka.tcp://sparkDriver@10.29.51.113:49961/user/CoarseGrainedScheduler --executor-id 4 --hostname cloudera-002.fusion.ebicus.com --cores 1 --app-id application_1461679867178_0026 --user-class-path file:$PWD/__app__.jar 1> <LOG_DIR>/stdout 2> <LOG_DIR>/stderr
===============================================================================
16/05/03 17:19:58 INFO impl.ContainerManagementProtocolProxy: Opening proxy : cloudera-002.fusion.ebicus.com:8041
16/05/03 17:20:01 INFO yarn.YarnAllocator: Completed container container_1461679867178_0026_01_000005 (state: COMPLETE, exit status: 1)
16/05/03 17:20:01 INFO yarn.YarnAllocator: Container marked as failed: container_1461679867178_0026_01_000005. Exit status: 1. Diagnostics: Exception from container-launch.
Container id: container_1461679867178_0026_01_000005
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:561)
at org.apache.hadoop.util.Shell.run(Shell.java:478)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:738)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:210)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
根据我以前的经验,有两种可能的情况可能会导致这个不是很描述性的错误(我从Eclipse提交作业,但使用Java)
>
我注意到您没有将JAR传递给SparkContext的配置。如果我删除从Eclipse提交时传递JAR的行,我的代码将以完全相同的错误失败。因此,基本上您将尚未存在的JAR的路径设置到代码中,然后将项目导出为可运行的JAR,这将把所有的传递依赖项打包到其中,并将其转换到您先前在代码中设置的路径中。这是它在Java中的外观:
SparkConf sparkConfiguration=new SparkConf();
sparkConfiguration.setjars(new string[]{“path to your jar”});
问题内容: 但是有很多歧义和提供的一些答案…包括在jars / executor / driver配置或选项中复制jar引用。 How ClassPath is affected Driver Executor (for tasks running) Both not at all Separation character: comma, colon, semicolon If provided
真的...已经讨论了很多。 然而,有很多模棱两可之处,提供的一些答案。。。包括在JAR/执行器/驱动程序配置或选项中复制JAR引用。 应为每个选项澄清以下歧义、不清楚和/或省略的细节: 类路径如何受到影响 驾驶员 执行器(用于正在运行的任务) 两者都有 一点也不 对于任务(给每个执行者) 方法 方法 或 ,或者 别忘了,spack-提交的最后一个参数也是一个. jar文件。 我知道在哪里可以找到主
我被困在: 在我得到这个之前: 当我签出应用程序跟踪页面时,我在stderr上得到以下信息: 我对这一切都很陌生,也许我的推理有缺陷,任何投入或建议都会有所帮助。
也许一定有一个更合适的方式来提交火花工作。有人知道如何将Apache Spark作业远程提交到hDinsight吗? 多谢!
im关注亚马逊文档,向emr集群提交spark作业https://aws.amazon.com/premiumsupport/knowledge-center/emr-submit-spark-job-remote-cluster/ 在按照说明进行操作后,使用frecuent进行故障排除,它由于未解析的地址与消息类似而失败。 错误火花。SparkContext:初始化SparkContext时出错
我对Spark很陌生,目前正在通过玩pyspark和Spark-Shell来探索它。 现在的情况是,我用pyspark和Spark-Shell运行相同的spark作业。 这是来自Pyspark: 使用spark-shell,工作在25分钟内完成,使用pyspark大约55分钟。如何让Spark独立地用pyspark分配任务,就像它用Spark-shell分配任务一样?