文档章节

Hadoop Spark

manonline
 manonline
发布于 2017/07/26 00:08
字数 517
阅读 3
收藏 0

Resilient Distributed Dataset

RDDs -> Transformation -> ... -> Transformation -> RDDs -> Action -> Result/Persistent Storage

  • Resilient means that Spark can automatically reconstruct a lost partition by RECOMPUTING IT FROM THE RDDS that it was computed from.
  • Dataset is Read-only collection of objects.
  • Distributed means Partitioned across the cluster.

Loading an RDD or performing a TRANSFORMATION on one does not trigger any data processing; it merely creates a plan for performing the computation. The computation is triggered only when an ACTION is called.

  • If the return type is RDD, the function is TRANSFORMATION
  • Otherwise, it's ACTION

Java RDD API

  • JavaRDDLike Interface
    • JavaRDD
    • JavaPairRDD (key-value pairs)

RDD Creation

  • From an in-memory collection of objects (Parallizing a Collection)
// RDD : 10 input values, i.e. 1 to 10. And Parallization level is 5
var params = sc.parallelize(1 to 10, 5)

// Computation : values are passed to the funcation and computation runs in parallel
var result = params.map(performExtensiveComputation)
  • Using a dataset from external storage
    • In the following example,  Spark uses TextInputFormat (same as old MapReduce API) to split and read the file. So by default, in the case of HDFS, there is one Spark partition per HDFS block.
// TextInputFormat
val text: RDD[String] = sc.textFile(inputPath)

// Sequence File
sc.sequenceFile[IntWritable, Text](inputPath)
// For common Writable, Spark can map them to Java equivalents
sc.sequenceFile[int, String](inputPath)

// using newAPIHadoopFile() and newAPIHadoopRDD()
// to create RDDs from an arbitary Hadoop InputFormat, such HBase
  • Transforming an existing RDD
    • Transformation: mapping, grouping, aggregating, repartitioning, sampling and joining RDD.
    • Action: materializing RDD as collections, computing statistics on RDD, sampling a fixed number of elements on RDD, saving RDD to external storage.

Cache

Spark will cache dataset in a cross-cluster in-memory cache, which means any computation on those datasets will be faster. MapReduce, however, to perform another calculation on the same input dataset will load the dataset from disk again. Even if there is intermediate dataset can be used as input, there is no getting away from the fact that the dataset has to be loaded from disk again.

This turns out to be tremendously helpful for interactive exploration of data, for example, getting the max, min and average on the same dataset.

Storage level

  • MEMORY_ONLY
  • MEMORY_ONLY_SER
  • MEMORY_AND_DISK
  • MEMORY_AND_DISK_SER

Spark Job

  • The application (SparkContext) serves to group RDDs and shared variable.
  • A job always runs in the Context of an Application.
    • An Application can run more than one job, in series or in parallel.
    • An Application provides the mechanism for a job to access an RDD that was cached by the previous job in the same application. 

Job Run

  • Driver: hosts Application (SparkConext) and schedule tasks for a job.
  • Executor: execute Application's Tasks
  • Job Submission: (Application -> Job -> Stages -> Tasks)
    • Calling any RDD.Action will submit the job automatically
    • runJob() will be called against SparkContext.
    • Scheduler will be called
      • DAG Scheduler breaks the job into a DAG of stages.
      • Task Scheduler submit from each stage to the cluster.
    • Task Execution

Cluster Resource Manager

  • Local
  • Standalone
  • Mesos
  • YARN
    • YARN Client Mode: 
      • Client -> driver -> SparkContext
      • SparkContext -> YARN application -> YARN Resoruce Manager
      • YARN Node -> Application Master of Spark ExecutorLauncher
    • YARN Cluster Mode: driver runs in a Application Master process.

 

 

© 著作权归作者所有

共有 人打赏支持
manonline
粉丝 0
博文 73
码字总数 66740
作品 0
Spark 1.4.1 Standalone 模式部署安装配置

各节点执行如下操作(或在一个节点上操作完后 scp 到其它节点): 1、 解压spark安装程序到程序目录/bigdata/soft/spark-1.4.1,约定此目录为$SPARKHOME tar –zxvf spark-1.4-bin-hadoop2.6....

山疯
2015/08/12
0
0
centOS7下Spark安装配置

starting org.apache.spark.deploy.master.Master, logging to /usr/spark/logs/spark-root-org.apache.spark.deploy.master.Master-1-master.out slave2: starting org.apache.spark.deploy......

艾艾贴
07/13
0
0
windows 安装 spark 及 pycharm 调试 TopN 实例

首先声明本文搭建的环境为:windows8.1 + spark1.6.0 + python2.7 + jdk8,spark on windows 对 windows及python版本不怎么挑,但是对 spark 版本要求极其苛刻,比如 spark1.6.1 就无法运行。...

大数据之路
2012/06/28
0
0
Hadoop CDH5 Spark部署

Spark是一个基于内存计算的开源的集群计算系统,目的是让数据分析更加快速,Spark 是一种与 Hadoop 相似的开源集群计算环境,但是两者之间还存在一些不同之处,这些有用的不同之处使 Spark ...

China_OS
2014/05/30
0
0
【Spark亚太研究院系列丛书】Spark实战高手之路-第3章Spark架构设计与编程模型第1节②

三、你为什么需要Spark; 你需要Spark的十大理由: 1,Spark是可以革命Hadoop的目前唯一替代者,能够做Hadoop做的一切事情,同时速度比Hadoop快了100倍以上: Logistic regression in Hadoo...

Spark亚太研究院
2014/12/16
0
1

没有更多内容

加载失败,请刷新页面

加载更多

下一页

【七】组合Action

本章描述了常用定义Action的方法。 自定义action builders 我们在action一章已经看过如何声明一个action——有request parameter、无request parameter、有body parser等等。你可以在 asynch...

Landas
34分钟前
0
0
Spring Boot实战之基础回顾

本文作者: 吴伟祥 本文链接: https://wuweixiang.cn/2018/08/21/Spring-Boot实战之基础回顾/ 版权声明: 本博客所有文章除特别声明外均为原创,采用CC BY-NC-SA 4.0 许可协议。转载请在文章开...

吴伟祥
34分钟前
0
0
OAuth认证开发

提示: 以下测试是基于项目安装成功,初始化数据库(initial_db.ddl, oauth.ddl, initial_data.ddl)后的测试, 也可在页面上点击"client_details"菜单里进行测试 方式1:基于浏览器 (grant_type=...

舒文joven
43分钟前
1
0
第二章-对象及变量的并发访问-第二篇

锁对象的改变 请阅读如下代码 public class MainClass { private String lock = "123"; public void printStringB() { try { synchronized (lock) { ......

简心
47分钟前
0
0
日志中记录代理IP以及真实客户端、apache只记录指定URI的日志

apache 日志中记录代理IP以及真实客户端 默认情况下log日志格式为: LogFormat "%h %l %u %t "%r" %>s %b "%{Referer}i" "%{User-Agent}i"" combined 其中%h 是记录访问者的IP,如果在web的前...

李超小牛子
56分钟前
2
0

没有更多内容

加载失败,请刷新页面

加载更多

下一页

返回顶部
顶部