文档章节

Spark 常见问题小结

岁月留痕
 岁月留痕
发布于 2015/12/31 14:54
字数 2464
阅读 1469
收藏 0

Spark is an excellent tool to use with Apache Cassandra and thanks to the DataStax OSS Spark Cassandra Connector it couldn’t be easier. But, as with any new system, there are some gotchas that can hold up new users. Here is a quick list of common problems and how to solve them!

(Note: While most of this is geared towards DataStax Enterprise Spark Standalone users, this advice should be applicable to most Apache Spark users as well!)

Update 2015-06-10: As a follow-on to this blog post, see Zen and the Art of Spark Maintenance which dives deeper into the inner workings of Spark and how you can shape your application to take advantage of interactions between Spark and Cassandra.

Initial job has not accepted any resources : Investigating the cluster state

This is by far the most common first error that a new Spark user will see when attempting to run a new application. Our new and excited Spark user will attempt to start the shell or run their own application and be met with the following message

WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster ui
to ensure that workers are registered and have sufficient memory

This message will pop up any time an application is requesting more resources from the cluster than the cluster can currently provide. What resources you might ask? Well Spark is only looking for two things: Cores and Ram. Cores represents the number of open executor slots that your cluster provides for execution. Ram refers to the amount of free Ram required on any worker running your application. Note for both of these resources the maximum value is not your System’s max, it is the max as set by the your Spark configuration. To see the current state of your cluster (and it’s free resources) check out the UI at SparkMasterIP:7080 (DSE users can find their SparkMaster URI using dsetool sparkmaster.)

NoCoresAval

In the above example is a picture of my Spark UI running on my Macbook (localhost:7080). You can see one of the Spark Shell applications is currently waiting. I caused this situation by starting the Spark Shell in 2 different terminals. The first Spark shell has consumed all the available cores in the system leaving the second shell waiting for resources. Until the first spark shell is terminated and its resources are released, all other apps will display the above warning. For more details on how to read the Spark UI check the section below.

The short term solution to this problem is to make sure you aren’t requesting more resources from your cluster than exist or to shut down any apps that are unnecessarily using resources. If you need to run multiple Spark apps simultaneously then you’ll need to adjust the amount of cores being used by each app.

Application isn’t using all of the Cores: How to set the Cores used by a Spark App

There are two important system variables that control how many cores a particular application will use in Apache Spark. The first is set on the Spark Master,spark.deploy.defaultCores. You can set this as -Dspark.deploy.defaultCores in the command line of your Spark Master startup script (In DSE look in resources/spark/conf/spark-env.sh). This option will set the default number of cores to be used by all applications started through this master. This means that if I had set spark.deploy.defaultCores=3 in my above example everything would have been fine. Each Spark Shell would have reserved 3 cores for execution and the two jobs could run simultaneously.

The second variable for controlling number of cores allows us to override the master default and set the number of cores on a per app basis. This variable, spark.cores.max, can be set programmatically or as a command line JVM option. Programmatically it is set by adding a key-value pair to the SparkConf object used for creating your spark context. On the command-line it is set using -Dspark.cores.max=N.

在command-line:--conf  spark.cores.max=N

Spark Executor OOM: How to set Memory Parameters on Spark

Once a app is running the next most likely error you will see is an OOM on a spark executor. Spark is an extremely powerful tool for doing in-memory computation but it’s power comes with some sharp edges. The most common cause for an executor OOM’ing is that the application is trying to cache or load too much information into memory. Depending on your use case there are several solutions to this:

  • Increase the parallelism of your job. Try increasing the number of partitions in your job. By splitting the work into smaller sets of data less information will have to be resident in memory at a given time. For a Spark Cassandra Connector job this would mean decreasing the split size variable. The variable, spark.cassandra.input.split.size, can be set either on the command line as above or in the SparkConf object. For other RDD types look into their api’s to determine exactly how they determine partition size.

  • Increase the storage fraction variable, spark.storage.memoryFraction. This can be set as above on either the command line or in the SparkConf object. This variable sets exactly how much of the JVM will be dedicated to the caching and storage of RDD’s. You can set it as a value between 0 and 1, describing what portion of executor JVM memory will be dedicated for caching RDDs. If you have a job that will require very little shuffle memory but will utilize a lot of cached RDD’s increase this variable (example: Caching an RDD then performing aggregates on it.)

  • If all else fails you may just need additional ram on each worker. For DSE users adjust your spark-env.sh (or dse.yaml file in DSE 4.6) file to increase SPARK_MEM reserved for Spark jobs. You will need to restart your workers for these new memory limits to take effect (dse sparkworker restart.) Then increase the amount of ram the application requests by setting spark.executor.memory variable either on the command line or in the SparkConf object.

  • Shark Server/ Long Running Application Metadata Cleanup

    As Spark applications run they create metadata objects which are stored in memory indefinitely by default. For Spark Streaming jobs you are forced to set the variable spark.cleaner.ttl to clean out these objects and prevent an OOM. On other long lived projects you must set this yourself. For those users of Shark Server this is especially important. For DSE users (4.5.x) you can set this property in your shark-env.sh file for Shark Server deployments. This will let you have a long running Shark processes without worrying about a sudden OOM. To check if this issue is causing your OOMs look in your heap dumps for a large number of scheduler.ShuffleMapTasks andscheduler.ResultTask objects.

    To set this you would end up modifying your SPARK_JAVA_OPTS variable like this
    export SPARK_JAVA_OPTS +="-Dspark.kryoserializer.buffer.mb=10 -Dspark.cleaner.ttl=43200"

    Class Not Found: Classpath Issues

    Another common issue is seeing class not defined when compiling Spark programs this is a slightly confusing topic because spark is actually running several JVM’s when it executes your process and the path must be correct for each of them. Usually this comes down to correctly passing around dependencies to the executors. Make sure that when running you include a fat Jar containing all of your dependencies, (I recommend using sbt assembly) in the SparkConf object used to make your Spark Context. You should end up writing a line like this in your spark application:


    val conf = new SparkConf().setAppName(appName).setJars(Seq(System.getProperty("user.dir") + "/target/scala-2.10/sparktest.jar"))

    This should fix the vast majority of class not found problems. Another option is to place your dependencies on the default classpath on all of the worker nodes in the cluster. This way you won’t have to pass around a large jar.

    The only other major issue with class not found issues stems from different versions of the libraries in use. For example if you don’t use identical versions of the common libraries in your application and in the spark server you will end up with classpath issues. This can occur when you compile against one version of a library (like Spark 1.1.0) and then attempt to run against a cluster with a different or out of date version (like Spark 0.9.2). Make sure that you are matching your library versions to whatever is being loaded onto executor classpaths. A common example of this would be compiling against an alpha build of the Spark Cassandra Connector then attempting to run using classpath references to an older version.

    What is happening : A Brief Tour of The Spark UI

    Once your job has started and it’s not throwing any exceptions you may want to get a picture of what's going on. All of the information about the current state of the application is available on the Spark UI. Here is a brief walkthrough starting with the initial screen>.

    If you are running on AWS or GCE you may find it useful to set SPARK_PUBLIC_DNS=PUBLIC_IP for each of the nodes in your cluster. This will cause make the links work correctly and not just connect to the internal provider IP addresses.

    SparkUIOpening

    You should see something very similar to this when accessing the UI page for your spark cluster. In the upper left (1) you’ll see the cluster wide over statistics showing exactly what resources are available. These numbers are aggregates for all of the workers and running jobs in the cluster. Starting at the Workers line (2) we’ll see what action is actually taking place on our cluster at the moment.

    First listed is exactly what workers are currently running and how utilized they are. You can see here on a node by node basis how much memory is free and how many cores are available. This is significant because the upper bound on how much memory an application can use is set at this level. For example a job which requests 8 GB of ram can only run on workers which have at least 8 GB of free ram. If any particular node has hung or is not appearing on the worker list try running dsetool sparkworker restart on that node to restart the worker process. (Note: you may see dead workers on this page if you have recently restarted nodes or spark worker processes, this is not an issue the master just hasn’t fully confirmed that the old worker is gone.)

    Below we that we can see the currently running spark applications (3). Since i’m currently running an instance of the Spark Shell you can see an entry for that listed as running. A line will appear here for every Spark Context object that is created communicating with this master. Since the spark context for the Spark Shell is created on startup and doesn’t close until the shell is closed, we should see this listed as long as we are running Spark Shell commands. In the Completed Applications section (4), we can see I’ve shut down that Spark Shell from earlier that was waiting for resources.

    In Spark 1.0+ you can enable event logging which will enable you to see the application detail for past applications but I haven’t for this example. This means we can only look into the state of currently running applications. To peek into the app we can click on the applications name (“Spark shell” directly to the right of the 3) and we’ll be taken to the App Detail page.

    SparkAppScreen

    Here we can see the various Stages that this application has completed. For this particular example I’ve run the following commands.


    sc.parallelize(1 until 10000).count
    sc.parallelize(1 until 10000).map( x => (x%30,x)).groupByKey().count

    You can see each Spark Transformation (map) and Action (count) is reported as a separate stage. We can see that in this case, each of the stages has already completed successfully. We can see for each stage exactly how many Tasks it was broken into and how many of them are currently complete. The number of Tasks here is the maximum level of parallelism that can be accomplished for that stage. The Tasks will be handed out to available executors, so if there are only 2 tasks but 4 executors (cores) then that stage can only ever be run on 2 cores at the same time. Adjusting how many tasks are created is dependent on the underlying RDD and the nature of the transformation you are running, be sure to check your RDD’s api to determine how many partitions/tasks will be created. To actually see the details on how a particular stage was accomplished click on the “Description” field for that stage to go to the Stage Detail page.

    SparkStageDetail

    This page gives us the nitty gritty of how our stage was actually accomplished. At the bottom we see a list of every task, on what machine/or core it was run and how long it took. This is a key place to look to find tasks that failed, on what node they failed and get information about where bottlenecks are. The Summary at the top of the page shows us the summary of all of the entries of the bottom of the page, making it easy to see if there are outliers that may have been running slow for some reason. Use this page to debug performance issues with your tasks.

    Let’s take a quick detour to the “Storage Tab” at the top of this screen. This will take us to a screen which looks like this.

    SparkStorageDetail

    Here we can see all the RDD’s currently stored and by clicking on the “RDD Name” link we can see exactly on which nodes data is being stored. This is helpful when trying to understand exactly whether or not your RDD’s are in memory or on disk. For this example I ran two quick RDD operations from the Spark Shell


    val rdd1 = sc.cassandraTable("ascii_ks","ascii_cs").map( row => row.getString("pkey") ).cache()
    // Cache tells Spark to save this rdd in memory after loading it
    rdd1.count
    val rdd2 = sc.cassandraTable("bigint_ks","bigint_cs").map( row => row.getString("pkey") ).cache()
    rdd2.count

    Cache commands indicate that spark needs to keep these rdd’s in memory. This will not cause the RDD to be instantly be cached, instead it will be cached the next time it is loaded into memory.

    Thats all for my most frequent troubleshooting tips, hopefully we’ll be able to enhance this guide and provide even more formal documentation for Spark. Check back soon!

转载:http://www.datastax.com/dev/blog/common-spark-troubleshooting

© 著作权归作者所有

岁月留痕
粉丝 3
博文 25
码字总数 25341
作品 0
郑州
私信 提问
Apache Spark 系列技术直播 - Spark SQL进阶与实战

Spark SQL进阶与实战 Spark相关组件介绍 Spark及其依赖组件 Hive Metastore介绍 Spark Thrift Server介绍 表与ETL Spark表基本概念 Spark建表最佳实践 Spark ETL最佳实践 动态分区表示例分析...

开源大数据
2018/12/05
0
0
MapReduce和Spark的区别

性能: Spark是在内存中处理数据的,而MapReduce是通过map和reduce操作在磁盘中处理数据,所以正常情况下Spark的处理速度会比mapreduce快。但是当数据量大,不能一次性加载到内存的时候,Spa...

无精疯
04/15
98
0
Spark on Phoenix 4.x Connector:如何在Spark侧设置Phoenix参数。

前言 X-Pack Spark可以使用Spark on Phoenix 4.x Connector直接对接Phoenix数据库,读取Phoenix数据表数据。有时在读取Phoenix时需要设置Phoenix的一些参数,例如Phoenix为了保障数据库的稳定...

云hbase+spark
07/22
0
0
使用IDEA 搭建 spark on yarn 的开发环境+调试~

1.导入yarn和hdfs配置文件 因为spark on yarn 是依赖于yarn和hdfs的,所以获取yarn和hdfs配置文件是首要条件,将core-site.xml、hdfs-site.xml 、yarn-site.xml 这三个文本考入到你IDEA项目里...

qq_31806205
2018/05/25
0
0
Spark2.1.0之模块设计

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/beliefer/article/details/80386736 在阅读本文之前,读者最好已经阅读了《Spark2.1.0之初识Spark》和《Spark...

泰山不老生
2018/06/05
0
0

没有更多内容

加载失败,请刷新页面

加载更多

vps管理程序win和Linux都有啥推荐?

如果你的手上有多台Windows的vps推荐您用iis7远程桌面管理工具,但是今天着重说一下linux vps需要管理,那有没有一个免费的工具监控在线率、CPU、内存、网络等使用情况并以网页的形式反馈出来...

1717197346
19分钟前
3
0
Java调用以太坊智能合约

Web3j让Java开发者可以轻松地访问以太坊区块链并调用区块链上的智能合约的方法,在本教程中,我们将学习如何创建一个简单的命令行应用来访问区块链上的合约。 1、什么是web3j Web3j是一个开发...

汇智网教程
23分钟前
2
0
从零开始入门 K8s| 阿里技术专家详解 K8s 核心概念

作者| 阿里巴巴资深技术专家、CNCF 9个 TCO 之一 李响 一、什么是 Kubernetes Kubernetes,从官方网站上可以看到,它是一个工业级的容器编排平台。Kubernetes 这个单词是希腊语,它的中文翻译...

阿里云官方博客
26分钟前
4
0
微信加好友 通过初始wxid,恢复好友聊天记录

一、聊天记录恢复以及怎么获得用户的wxid 聊天记录网上很多方法、前提是你没有点击微信设置里面的清除聊天记录 单单是删除了与这个人对话 记录还是会存在的 之前我用的是楼月的微信聊天恢复助...

青峰Jun19er
28分钟前
4
0
鲲鹏发力,神秘中国架构现世,ZStack搭上了国产化的高铁?

以下文字来自ZStack社区用户,ZStack作为国产自研的开源云平台,感谢在国产化道路上大家一直以来的支持,接下来也请大家继续指教,ZStack也会坚持初心,抗好国产化的大旗。 鲲鹏这两天挺火的...

ZStack社区版
34分钟前
3
0

没有更多内容

加载失败,请刷新页面

加载更多

返回顶部
顶部