文档章节

Understanding Spark Caching

我是彩笔
 我是彩笔
发布于 2015/11/25 10:35
字数 408
阅读 38
收藏 0

Spark excels at processing in-memory data.  We are going to look at various caching options and their effects, and (hopefully) provide some tips for optimizing Spark memory caching.

When caching in Spark, there are two options

1. Raw storage

2. Serialized

Here are some differences between the two options

Raw caching

Serialized Caching

Pretty fast to process Slower processing than raw caching
Can take up 2x-4x more spaceFor example, 100MB data cached could consume 350MB memory Overhead is minimal
can put pressure in JVM and JVM garbage collection less pressure

usage:rdd.persist(StorageLevel.MEMORY_ONLY)  or  rdd.cache()

usage:rdd.persist(StorageLevel.MEMORY_ONLY_SER

So what are the trade offs?

Here is a quick experiment.  I cache a bunch of RDDs using both options and measure memory footprint and processing time.  My RDDs range in size from 100MB to 1GB.

Testing environment:

3 node spark cluster running on Amazon EC2 (m1.large type with 8G memory per node)

Reading data files from S3 bucket

Testing method:

$   ./bin/spark-shell  --driver-memory 8g
> val f = sc.textFile("s3n://bucket_path/1G.data")
> f.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY) // specify the cache option
> f.count()  // do this a few times and measure times
// also look at RDD memory size from Spark application UI, under 'Storage' tab

On to the results:

Data Size
100M 500M 1000M (1G)
Memory Footprint (MB)




raw 373.8 1,869.20 3788.8

serialized 107.5 537.6 1075.1
count() time (ms)




cached raw 90 ms 130 ms 178 ms

cached serialized 610 ms 1,802 ms 3,448 ms

before caching 3,220 ms 27,063 ms 105,618 ms


Conclusions

raw caching consumes has a bigger footprint in  in memory – about 2x – 4x (e.g. 100MB RDD becomes 370MB)

Serialized caching consumes almost the same amount of memory as RDD (plus some overhead)

Raw cache is very fast to process, and it scales pretty well

Processing serialized cached data takes longer

So what does all this mean?

For small data sets (few hundred megs) we can use raw caching.  Even though this will consume more memory, the small size won’t put too much pressure on Java garbage collection.

Raw caching is also good for iterative work loads (say we are doing a bunch of iterations over data).  Because the processing is very fast

For medium / large data sets (10s of Gigs or 100s of Gigs) serialized caching would be helpful.  Because this will not consume too much memory.  And garbage collecting gigs of memory can be taxing


本文转载自:http://sujee.net/2015/01/22/understanding-spark-caching/#.VlUcx3YrLRa

共有 人打赏支持
我是彩笔
粉丝 7
博文 23
码字总数 1936
作品 0
浦东
私信 提问
你不能错过的 spark 学习资源

1. 书籍,在线文档 2. 网站 3. Databricks Blog 4. 文章,博客 5. 视频

u012608836
2018/04/12
0
0
用Spark解决一些经典MapReduce问题

摘要 Spark是一个Apache项目,它被标榜为“快如闪电的集群计算”。它拥有一个繁荣的开源社区,并且是目前最活跃的Apache项目。Spark提供了一个更快、更通用的数据处理平台。和Hadoop相比,S...

力谱宿云
2016/12/01
394
0
Apache Spark 2.4.0 正式发布

Apache Spark 2.4 与昨天正式发布,Apache Spark 2.4 版本是 2.x 系列的第五个版本。 如果想及时了解 Spark、Hadoop或者Hbase相关的文章,欢迎关注微信公共帐号: itebloghadoop Apache Spa...

Spark
2018/11/09
0
0
Tuning Java Garbage Collection for Spark Applicati

This is a guest post from our friends in the SSG STO Big Data Technology group at Intel. Join us at the Spark Summit to hear from Intel and other companies deploying Spark in pr......

kuerant
2015/05/30
0
1
Apache Spark 2.0.0 发布,APIs 更新

Apache Spark 2.0.0 发布了,Apache Spark 是一种与 Hadoop 相似的开源集群计算环境,但是两者之间还存在一些不同之处,这些有用的不同之处使 Spark 在某些工作负载方面表现得更加优越,换句...

oschina
2016/07/28
9.7K
22

没有更多内容

加载失败,请刷新页面

加载更多

Spring Boot 集成 Swagger,生成接口文档就这么简单!

之前的文章介绍了《推荐一款接口 API 设计神器!》,今天栈长给大家介绍下如何与优秀的 Spring Boot 框架进行集成,简直不能太简单。 你所需具备的基础 告诉你,Spring Boot 真是个牛逼货! ...

Java技术栈
24分钟前
3
0
一个简单的js作用域题目(原创)

var name = 'nnmm' var obj = { name: 'name1', func: () => { console.log(this.name) }, func1: function (){ console.log(this.name) }, son: { ......

boogoogle
26分钟前
2
0
SSM整合activeMQ/activeMQ配置

一、引入依赖 <!-- xbean 如<amq:connectionFactory /> -->    <dependency>        <groupId>org.apache.xbean</groupId>        <artifactId>xbean-spring</artifac......

嘴角轻扬30
30分钟前
2
0
小公司出身的程序员,面试咋这么难?!

小公司出身,被大厂竞争者 KO 以下是一个非常真实的案例,是一个大厂工程师和一个小公司工程师同时求职一个独角兽公司的职位的经历。 一个是985本科学历,出身互联网大厂,四五年经验的样子。...

编程SHA
36分钟前
2
0
揭秘:蚂蚁金服bPaaS究竟是什么?

摘要: 分布式金融核心套件,蚂蚁金服bPaaS究竟是什么东东? 文/图 孙浩峰 去年9月,蚂蚁金服在杭州云栖ATEC发布了分布式金融核心套件bPaaS( Business Platform As a Service ),对外开放自...

阿里云官方博客
37分钟前
1
0

没有更多内容

加载失败,请刷新页面

加载更多

返回顶部
顶部