文档章节

Hadoop Spark

manonline
 manonline
发布于 2017/07/26 00:08
字数 517
阅读 3
收藏 0

Resilient Distributed Dataset

RDDs -> Transformation -> ... -> Transformation -> RDDs -> Action -> Result/Persistent Storage

  • Resilient means that Spark can automatically reconstruct a lost partition by RECOMPUTING IT FROM THE RDDS that it was computed from.
  • Dataset is Read-only collection of objects.
  • Distributed means Partitioned across the cluster.

Loading an RDD or performing a TRANSFORMATION on one does not trigger any data processing; it merely creates a plan for performing the computation. The computation is triggered only when an ACTION is called.

  • If the return type is RDD, the function is TRANSFORMATION
  • Otherwise, it's ACTION

Java RDD API

  • JavaRDDLike Interface
    • JavaRDD
    • JavaPairRDD (key-value pairs)

RDD Creation

  • From an in-memory collection of objects (Parallizing a Collection)
// RDD : 10 input values, i.e. 1 to 10. And Parallization level is 5
var params = sc.parallelize(1 to 10, 5)

// Computation : values are passed to the funcation and computation runs in parallel
var result = params.map(performExtensiveComputation)
  • Using a dataset from external storage
    • In the following example,  Spark uses TextInputFormat (same as old MapReduce API) to split and read the file. So by default, in the case of HDFS, there is one Spark partition per HDFS block.
// TextInputFormat
val text: RDD[String] = sc.textFile(inputPath)

// Sequence File
sc.sequenceFile[IntWritable, Text](inputPath)
// For common Writable, Spark can map them to Java equivalents
sc.sequenceFile[int, String](inputPath)

// using newAPIHadoopFile() and newAPIHadoopRDD()
// to create RDDs from an arbitary Hadoop InputFormat, such HBase
  • Transforming an existing RDD
    • Transformation: mapping, grouping, aggregating, repartitioning, sampling and joining RDD.
    • Action: materializing RDD as collections, computing statistics on RDD, sampling a fixed number of elements on RDD, saving RDD to external storage.

Cache

Spark will cache dataset in a cross-cluster in-memory cache, which means any computation on those datasets will be faster. MapReduce, however, to perform another calculation on the same input dataset will load the dataset from disk again. Even if there is intermediate dataset can be used as input, there is no getting away from the fact that the dataset has to be loaded from disk again.

This turns out to be tremendously helpful for interactive exploration of data, for example, getting the max, min and average on the same dataset.

Storage level

  • MEMORY_ONLY
  • MEMORY_ONLY_SER
  • MEMORY_AND_DISK
  • MEMORY_AND_DISK_SER

Spark Job

  • The application (SparkContext) serves to group RDDs and shared variable.
  • A job always runs in the Context of an Application.
    • An Application can run more than one job, in series or in parallel.
    • An Application provides the mechanism for a job to access an RDD that was cached by the previous job in the same application. 

Job Run

  • Driver: hosts Application (SparkConext) and schedule tasks for a job.
  • Executor: execute Application's Tasks
  • Job Submission: (Application -> Job -> Stages -> Tasks)
    • Calling any RDD.Action will submit the job automatically
    • runJob() will be called against SparkContext.
    • Scheduler will be called
      • DAG Scheduler breaks the job into a DAG of stages.
      • Task Scheduler submit from each stage to the cluster.
    • Task Execution

Cluster Resource Manager

  • Local
  • Standalone
  • Mesos
  • YARN
    • YARN Client Mode: 
      • Client -> driver -> SparkContext
      • SparkContext -> YARN application -> YARN Resoruce Manager
      • YARN Node -> Application Master of Spark ExecutorLauncher
    • YARN Cluster Mode: driver runs in a Application Master process.

 

 

© 著作权归作者所有

共有 人打赏支持
manonline
粉丝 0
博文 73
码字总数 66740
作品 0
Spark 1.4.1 Standalone 模式部署安装配置

各节点执行如下操作(或在一个节点上操作完后 scp 到其它节点): 1、 解压spark安装程序到程序目录/bigdata/soft/spark-1.4.1,约定此目录为$SPARKHOME tar –zxvf spark-1.4-bin-hadoop2.6....

山疯
2015/08/12
0
0
windows 安装 spark 及 pycharm 调试 TopN 实例

首先声明本文搭建的环境为:windows8.1 + spark1.6.0 + python2.7 + jdk8,spark on windows 对 windows及python版本不怎么挑,但是对 spark 版本要求极其苛刻,比如 spark1.6.1 就无法运行。...

大数据之路
2012/06/28
0
0
centOS7下Spark安装配置

starting org.apache.spark.deploy.master.Master, logging to /usr/spark/logs/spark-root-org.apache.spark.deploy.master.Master-1-master.out slave2: starting org.apache.spark.deploy......

艾艾贴
07/13
0
0
Hadoop CDH5 Spark部署

Spark是一个基于内存计算的开源的集群计算系统,目的是让数据分析更加快速,Spark 是一种与 Hadoop 相似的开源集群计算环境,但是两者之间还存在一些不同之处,这些有用的不同之处使 Spark ...

China_OS
2014/05/30
0
0
【Spark亚太研究院系列丛书】Spark实战高手之路-第3章Spark架构设计与编程模型第1节②

三、你为什么需要Spark; 你需要Spark的十大理由: 1,Spark是可以革命Hadoop的目前唯一替代者,能够做Hadoop做的一切事情,同时速度比Hadoop快了100倍以上: Logistic regression in Hadoo...

Spark亚太研究院
2014/12/16
0
1

没有更多内容

加载失败,请刷新页面

加载更多

Spring加载properties文件的两种方式

在项目中如果有些参数经常需要修改,或者后期可能需要修改,那我们最好把这些参数放到properties文件中,源代码中读取properties里面的配置,这样后期只需要改动properties文件即可,不需要修...

架构师springboot
17分钟前
0
0
分布式事务,原来可以这么玩?

多个数据要同时操作,如何保证数据的完整性,以及一致性? 答 : 事务 ,是常见的做法。 举个栗子: 用户下了一个订单,需要修改 余额表 , 订单 表 , 流水 表 ,于是会有类似的伪代码: st...

微笑向暖wx
20分钟前
1
0
IE6兼容PNG32图片显示PNG8图片

IE6并不是不支持PNG图片,只是不支持半透明通道。 是支持PNG8色表引索全透明的。 以往都是通过滤镜或统统使用PNG8实现兼容。 但是我发现twitter的png图标可以在chrome中显示png32,在IE6显示...

linsk1998
32分钟前
0
0
linux运维需要掌握的基础知识

踏入linux运维工程师这一职业,其实有很多工具技能需要掌握,下面我来给大家一一介绍。 1、shell脚本和另一个脚本语言,shell是运维人员必须具备的,不懂这个连入职都不行,至少也要写出一些...

linuxprobe16
33分钟前
0
0
《netty入门与实战》笔记-03:数据传输载体 ByteBuf 介绍

ByteBuf结构 首先,我们先来了解一下 ByteBuf 的结构 以上就是一个 ByteBuf 的结构图,从上面这幅图可以看到: ByteBuf 是一个字节容器,容器里面的的数据分为三个部分,第一个部分是已经丢弃...

Funcy1122
今天
5
0

没有更多内容

加载失败,请刷新页面

加载更多

返回顶部
顶部