文档章节

Hadoop Spark

manonline
 manonline
发布于 2017/07/26 00:08
字数 517
阅读 3
收藏 0

Resilient Distributed Dataset

RDDs -> Transformation -> ... -> Transformation -> RDDs -> Action -> Result/Persistent Storage

  • Resilient means that Spark can automatically reconstruct a lost partition by RECOMPUTING IT FROM THE RDDS that it was computed from.
  • Dataset is Read-only collection of objects.
  • Distributed means Partitioned across the cluster.

Loading an RDD or performing a TRANSFORMATION on one does not trigger any data processing; it merely creates a plan for performing the computation. The computation is triggered only when an ACTION is called.

  • If the return type is RDD, the function is TRANSFORMATION
  • Otherwise, it's ACTION

Java RDD API

  • JavaRDDLike Interface
    • JavaRDD
    • JavaPairRDD (key-value pairs)

RDD Creation

  • From an in-memory collection of objects (Parallizing a Collection)
// RDD : 10 input values, i.e. 1 to 10. And Parallization level is 5
var params = sc.parallelize(1 to 10, 5)

// Computation : values are passed to the funcation and computation runs in parallel
var result = params.map(performExtensiveComputation)
  • Using a dataset from external storage
    • In the following example,  Spark uses TextInputFormat (same as old MapReduce API) to split and read the file. So by default, in the case of HDFS, there is one Spark partition per HDFS block.
// TextInputFormat
val text: RDD[String] = sc.textFile(inputPath)

// Sequence File
sc.sequenceFile[IntWritable, Text](inputPath)
// For common Writable, Spark can map them to Java equivalents
sc.sequenceFile[int, String](inputPath)

// using newAPIHadoopFile() and newAPIHadoopRDD()
// to create RDDs from an arbitary Hadoop InputFormat, such HBase
  • Transforming an existing RDD
    • Transformation: mapping, grouping, aggregating, repartitioning, sampling and joining RDD.
    • Action: materializing RDD as collections, computing statistics on RDD, sampling a fixed number of elements on RDD, saving RDD to external storage.

Cache

Spark will cache dataset in a cross-cluster in-memory cache, which means any computation on those datasets will be faster. MapReduce, however, to perform another calculation on the same input dataset will load the dataset from disk again. Even if there is intermediate dataset can be used as input, there is no getting away from the fact that the dataset has to be loaded from disk again.

This turns out to be tremendously helpful for interactive exploration of data, for example, getting the max, min and average on the same dataset.

Storage level

  • MEMORY_ONLY
  • MEMORY_ONLY_SER
  • MEMORY_AND_DISK
  • MEMORY_AND_DISK_SER

Spark Job

  • The application (SparkContext) serves to group RDDs and shared variable.
  • A job always runs in the Context of an Application.
    • An Application can run more than one job, in series or in parallel.
    • An Application provides the mechanism for a job to access an RDD that was cached by the previous job in the same application. 

Job Run

  • Driver: hosts Application (SparkConext) and schedule tasks for a job.
  • Executor: execute Application's Tasks
  • Job Submission: (Application -> Job -> Stages -> Tasks)
    • Calling any RDD.Action will submit the job automatically
    • runJob() will be called against SparkContext.
    • Scheduler will be called
      • DAG Scheduler breaks the job into a DAG of stages.
      • Task Scheduler submit from each stage to the cluster.
    • Task Execution

Cluster Resource Manager

  • Local
  • Standalone
  • Mesos
  • YARN
    • YARN Client Mode: 
      • Client -> driver -> SparkContext
      • SparkContext -> YARN application -> YARN Resoruce Manager
      • YARN Node -> Application Master of Spark ExecutorLauncher
    • YARN Cluster Mode: driver runs in a Application Master process.

 

 

© 著作权归作者所有

共有 人打赏支持
上一篇: Storm
下一篇: Hadoop HBase
manonline
粉丝 0
博文 73
码字总数 66740
作品 0
私信 提问
/usr/spark/sbin/start-all.sh 启动 spark失败,怎么搞

@eagleonline 你好,想跟你请教个问题: /usr/spark/sbin/start-all.sh starting org.apache.spark.deploy.master.Master, logging to /usr/spark/sbin/../logs/spark-hadoop-org.apache.sp......

天池番薯
2015/09/24
8.9K
3
【Spark亚太研究院系列丛书】Spark实战高手之路-第3章Spark架构设计与编程模型第1节②

三、你为什么需要Spark; 你需要Spark的十大理由: 1,Spark是可以革命Hadoop的目前唯一替代者,能够做Hadoop做的一切事情,同时速度比Hadoop快了100倍以上: Logistic regression in Hadoo...

Spark亚太研究院
2014/12/16
0
1
【Spark亚太研究院系列丛书】Spark实战高手之路-第3章Spark架构设计与编程模型第1节 ①

一、MapReduce已死,Spark称霸 由于Hadoop的MapReduce高延迟的死穴,导致Hadoop无力处理很多对时间有要求的场景,人们对其批评越来越多,Hadoop无力改变现在而导致正在死亡。正如任何领域一样...

Spark亚太研究院
2014/12/12
0
0
Spark 1.2.2/1.3.1 发布,开源集群计算系统

Spark 1.2.2 和 Spark 1.3.1 发布啦!这两个版本是维护版本,超过 90 位开发者在维护 Spark。 Spark 1.2.2 包括稳定性方面的 bug 修复: Spark Core Thread safety problem in Netty shuffl...

oschina
2015/04/20
2.7K
2
Spark 1.4.1 Standalone 模式部署安装配置

各节点执行如下操作(或在一个节点上操作完后 scp 到其它节点): 1、 解压spark安装程序到程序目录/bigdata/soft/spark-1.4.1,约定此目录为$SPARKHOME tar –zxvf spark-1.4-bin-hadoop2.6....

山疯
2015/08/12
0
0

没有更多内容

加载失败,请刷新页面

加载更多

C#匿名委托

list自定义排序 //list自定义排序public static List<string> sortList(List<string> m_str,string splitStr) //a b表示列表中的元素{String[] strArray=m_str.ToArray();......

青衣霓裳
18分钟前
3
0
Python 之父退位后,会有新任终身仁慈独裁者吗?怎么产生?

随着 Python 之父 Guido van Rossum 逐步卸任 BDFL,Python(CPython)的未来之路牵动了万千开发者的心。没了首领,Python 今后的发展会怎么样?社区将如何运作?谁来领导 Python 这门语言和...

编辑部的故事
23分钟前
7
0
我的Linux系统九阴真经

在今天,互联网的迅猛发展,科技技术也日新月异,各种编程技术也如雨后春笋一样,冒出尖来了。各种创业公司也百花齐放百家争鸣,特别是针对服务行业,新型互联网服务行业,共享经济等概念的公...

linux-tao
今天
20
0
MySQL: Starting MySQL….. ERROR! The server quit without updating PID file

前段时间打包了一个数据库镜像,但是启动容器之后发现报错 ··· ··· MySQL: Starting MySQL….. ERROR! The server quit without updating PID file 查了网络上的解决方案比较全,遂转帖...

blackfoxya
今天
7
0
C4C销售订单行项目价格维护方法

需求很简单,能够创建销售订单,在行项目里添加产品,带出价格来,同时把总价显示在销售订单抬头区域。 如下图所示: 下面是具体配置。 Business Configuration里,点击Sales Order的配置: ...

JerryWang_SAP
今天
17
0

没有更多内容

加载失败,请刷新页面

加载更多

返回顶部
顶部