文档章节

100 open source Big Data architecture papers

naughty
 naughty
发布于 2016/04/05 09:36
字数 1990
阅读 170
收藏 2

「深度学习福利」大神带你进阶工程师,立即查看>>>


Big Data technology has been extremely disruptive with open source playing a dominant role in shaping its evolution. While on one hand it has been disruptive, on the other it has led to a complex ecosystem where new frameworks, libraries and tools are being released pretty much every day, creating confusion as technologists struggle and grapple with the deluge.

If you are a Big Data enthusiast or a technologist ramping up (or scratching your head), it is important to spend some serious time deeply understanding the architecture of key systems to appreciate its evolution. Understanding the architectural components and subtleties would also help you choose and apply the appropriate technology for your use case. In my journey over the last few years, some literature has helped me become a better educated data professional. My goal here is to not only share the literature but consequently also use the opportunity to put some sanity into the labyrinth of open source systems.

One caution, most of the reference literature included is hugely skewed towards deep architecture overview (in most cases original research papers) than simply provide you with basic overview. I firmly believe that deep dive will fundamentally help you understand the nuances, though would not provide you with any shortcuts, if you want to get a quick basic overview.

Jumping right in…

Key architecture layers

  • File Systems - Distributed file systems which provide storage, fault tolerance, scalability, reliability, and availability.

  • Data Stores – Evolution of application databases into  Polyglot  storage with application specific databases instead of one size fits all. Common ones are Key-Value, Document, Column and Graph.

  • Resource Managers – provide resource management capabilities and support schedulers for high utilization and throughput.

  • Coordination – systems that manage state, distributed coordination, consensus and lock management.

  • Computational Frameworks – a lot of work is happening at this layer with highly specialized compute frameworks for Streaming, Interactive, Real Time, Batch and Iterative Graph (BSP) processing. Powering these are complete computation runtimes like BDAS  (Spark) & Flink.

  • Data Analytics  –Analytical (consumption) tools and libraries, which support exploratory, descriptive, predictive, statistical analysis and machine learning.

  • Data Integration – these include not only the orchestration tools for managing pipelines but also metadata management.

  • Operational Frameworks – these provide scalable frameworks for monitoring & benchmarking.

Architecture Evolution

The modern data architecture is evolving with a goal of reduced latency between data producers and consumers. This consequently is leading to real time and low latency processing, bridging the traditional batch and interactive layers into hybrid architectures like Lambda and Kappa.

  • Lambda - Established architecture for a typical data pipeline. Mor e details.

  • Kappa – An alternative architecture which moves the processing upstream to the Stream layer.

  • SummingBird – a reference model on bridging the online and traditional processing models. 

Before you deep dive into the actual layers, here are some general documents which can provide you a great background on NoSQL, Data Warehouse Scale Computing and Distributed Systems.

  • Data center as a computer – provides a great background on warehouse scale computing.

  • NOSQL Data Stores – background on a diverse set of key-value, document and column oriented stores.

  • NoSQL Thesis – great background on distributed systems, first generation NoSQL systems.

  • Large Scale Data Management - covers the data model, the system architecture and the consistency model, ranging from traditional database vendors to new emerging internet-based enterprises. 

  • Eventual Consistency – background on the different consistency models for distributed systems.

  • CAP Theorem – a nice background on CAP and its evolution.

There also has been in the past a fierce debate between traditional Parallel DBMS with Map Reduce paradigm of processing. Pro parallel DBMS ( another ) paper(s) was rebutted by the pro MapReduce one. Ironically the  Hadoop community from then has come full circle with the introduction of MPI style shared nothing based processing on Hadoop -  SQL on Hadoo p. 

File Systems 

As the focus shifts to low latency processing, there is a shift from traditional disk based storage file systems to an  emergence of in memory file systems - which drastically reduces the I/O & disk serialization cost. Tachyon and Spark RDD are examples of that evolution.

File Systems have also seen an evolution on the file formats and compression techniques. The following references gives you a great background on the merits of row and column formats and the shift towards newer nested column oriented formats which are highly efficient for Big Data processing. Erasure codes are using some innovative techniques to reduce the triplication (3 replicas) schemes without compromising data recoverability and availability.

  • Column Oriented vs Row-Stores – good overview of data layout, compression and materialization.

  • RCFile – Hybrid PAX structure which takes the best of both the column and row oriented stores.

  • Parquet – column oriented format first covered in Google’s Dremel’s paper.

  • ORCFile – an improved column oriented format used by Hive.

  • Compression – compression techniques and their comparison on the Hadoop ecosystem.

  • Erasure Codes – background on erasure codes and techniques; improvement on the default triplication on  Hadoop  to reduce storage cost.

Data Stores

Broadly, the distributed data stores are classified on ACID & BASE stores depending on the continuum of strong to weak consistency respectively. BASE further is classified into KeyValue, Document, Column and Graph - depending on the underlying schema & supported data structure. While there are multitude of systems and offerings in this space, I have covered few of the more prominent ones. I apologize if I have missed a significant one...

BASE

Key Value Stores

Dynamo – key-value distributed storage system

Cassandra – Inspired by Dynamo; a multi-dimensional key-value/column oriented data store.

Voldemort – another one inspired by Dynamo, developed at LinkedIn.

Column Oriented Stores

BigTable – seminal paper from Google on distributed column oriented data stores.

HBase – while there is no definitive paper , this provides a good overview of the technology.

Hypertable – provides a good overview of the architecture.

Document Oriented Stores

CouchDB – a popular document oriented data store.

MongoDB – a good introduction to MongoDB architecture.

Graph

Neo4j – most popular Graph database.

Titan – open source Graph database under the Apache license.

ACID

I see a lot of evolution happening in the open source community which will try and catch up with what Google has done – 3 out of the prominent papers below are from Google , they have solved the globally distributed consistent data store problem.

Megastore – a highly available distributed consistent database. Uses Bigtable as its storage subsystem.

Spanner – Globally distributed synchronously replicated linearizable database which supports SQL access.

MESA – provides consistency, high availability, reliability, fault tolerance and scalability for large data and query volumes.

CockroachDB – An open source version of Spanner (led by former engineers) in active development.

Resource Managers

While the first generation of Hadoop ecosystem started with monolithic schedulers like YARN, the evolution now is towards hierarchical schedulers (Mesos), that can manage distinct workloads, across different kind of compute workloads, to achieve higher utilization and efficiency.

YARN – The next generation Hadoop compute framework.

Mesos – scheduling between multiple diverse cluster computing frameworks.

These are loosely coupled with schedulers whose primary function is schedule jobs based on scheduling policies/configuration.

Schedulers

Capacity Scheduler - introduction to different features of capacity scheduler. 

FairShare Scheduler - introduction to different features of fair scheduler.

Delayed Scheduling - introduction to Delayed Scheduling for FairShare scheduler.

Fair & Capacity schedulers – a survey of Hadoop schedulers.

Coordination

These are systems that are used for coordination and state management across distributed data systems.

Paxos – a simple version of the  classical paper; used for distributed systems consensus and coordination. 

Chubby – Google’s distributed locking service that implements Paxos.

Zookeeper – open source version inspired from Chubby though is general coordination service than simply a locking service 

Computational Frameworks

The execution runtimes provide an environment for running distinct kinds of compute. The most common runtimes are

Spark – its popularity and adoption is challenging the traditional Hadoop ecosystem. 
Flink – very similar to Spark ecosystem; strength over Spark is in iterative processing.

The frameworks broadly can be classified based on the model and latency of processing

Batch

MapReduce – The seminal paper from Google on MapReduce.

MapReduce Survey – A dated, yet a good paper; survey of Map Reduce frameworks.

Iterative (BSP)

Pregel – Google’s paper on large scale graph processing

Giraph - large-scale distributed Graph processing system modelled around Pregel

GraphX - graph computation framework that unifies graph-parallel and data parallel computation.

Hama - general BSP computing engine on top of Hadoop

Open source graph processing  survey of open source systems modelled around Pregel BSP.

Streaming

Stream Processing – A great overview of the distinct real time processing systems 

Storm – Real time big data processing system

Samza - stream processing framework from LinkedIn

Spark Streaming – introduced the micro batch architecture bridging the traditional batch and interactive processing.

Interactive

Dremel – Google’s paper on how it processes interactive big data workloads, which laid the groundwork for multiple open source SQL systems on Hadoop.

Impala – MPI style processing on make Hadoop performant for interactive workloads.

Drill – A open source implementation of Dremel.

Shark – provides a good introduction to the data analysis capabilities on the Spark ecosystem.

Shark – another great paper which goes deeper into SQL access.

Dryad – Configuring & executing parallel data pipelines using DAG.

Tez – open source implementation of Dryad using YARN.

BlinkDB - enabling interactive queries over data samples and presenting results annotated with meaningful error bars

RealTime

Druid – a real time OLAP data store. Operationalized time series analytics databases

Pinot – LinkedIn OLAP data store very similar to Druid. 

Data Analysis

The analysis tools range from declarative languages like SQL to procedural languages like Pig. Libraries on the other hand are supporting out of the box implementations of the most common data mining and machine learning libraries.

Tools

   Pig – Provides a good overview of Pig Latin.

Pig – provide an introduction of how to build data pipelines using Pig.

Hive – provides an introduction of Hive.

Hive – another good paper to understand the motivations behind Hive at Facebook.

Phoenix – SQL on Hbase.

Join Algorithms for Map Reduce – provides a great introduction to different join algorithms on Hadoop. 

Join Algorithms for Map Reduce – another great paper on the different join techniques.

Libraires

MLlib – Machine language framework on Spark.

SparkR – Distributed R on Spark framework.

Mahout – Machine learning framework on traditional Map Reduce.

Data Integration

Data integration frameworks provide good mechanisms to ingest and outgest data between Big Data systems. It ranges from orchestration pipelines to metadata framework with support for lifecycle management and governance.

Ingest/Messaging

Flume – a framework for collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.

Sqoop – a tool to move data between Hadoop and Relational data stores.

Kafka – distributed messaging system for data processing

ETL/Workflow

Crunch – library for writing, testing, and running MapReduce pipelines.

Falcon – data management framework that helps automate movement and processing of Big Data.

Cascading – data manipulation through scripting.

Oozie – a workflow scheduler system to manage Hadoop jobs.

Metadata

HCatalog - a table and storage management layer for Hadoop.

Serialization

ProtocolBuffers – language neutral serialization format popularized by Google. Avro – modeled around Protocol Buffers for the Hadoop ecosystem.

Operational Frameworks

Finally the operational frameworks provide capabilities for metrics, benchmarking and performance optimization to manage workloads.

Monitoring Frameworks

OpenTSDB – a time series metrics systems built on top of HBase.

Ambari - system for collecting, aggregating and serving Hadoop and system metrics

Benchmarking

YCSB – performance evaluation of NoSQL systems.

GridMix – provides benchmark for Hadoop workloads by running a mix of synthetic jobs

Background on big data benchmarking with the key challenges associated.

Summary

I hope that the papers are useful as you embark or strengthen your journey. I am sure there are few hundred more papers that I might have inadvertently missed and a whole bunch of systems that  I might be unfamiliar with - apologies in advance as don't mean to offend anyone though happy to be educated....


naughty

naughty

粉丝 487
博文 86
码字总数 180299
作品 2
其它
私信 提问
加载中
请先登录后再评论。
CDH5: 使用parcels配置lzo

一、Parcel 部署步骤 1 下载: 首先需要下载 Parcel。下载完成后,Parcel 将驻留在 Cloudera Manager 主机的本地目录中。 2 分配: Parcel 下载后,将分配到群集中的所有主机上并解压缩。 3 激...

cloud-coder
2014/07/01
6.9K
1
Nutch学习笔记4-Nutch 1.7 的 索引篇 ElasticSearch

上一篇讲解了爬取和分析的流程,很重要的收获就是: 解析过程中,会根据页面的ContentType获得一系列的注册解析器, 依次调用每个解析器,当其中一个解析成功后就返回,否则继续执行下一个解...

强子哥哥
2014/06/26
712
0
密码管理程序--pwgrep

为了管理我的密码,我写了一个小的 bash/awk 脚本用来管理一个密码数据库并使用 GnuPG 进行加密。使用 pwgrep 的好处是: 密码加密 密码版本化,不用担心丢失老密码 Since a versioning sys...

匿名
2013/03/11
1.3K
0
开源数据访问组件--Smark.Data

Smark.Data是基于Ado.net实现的数据访问组件,提供基于强类型的查询表达式进行灵活的数据查询,统计,修改和删除等操作;采用基于条件驱动的操作模式,使数据操作更简单轻松;内部通过标准SQL...

泥水佬
2013/03/12
2.6K
0
Find5-Kernel-Source

这是 OPPO 的 Find5 手机的内核源码。 Find 5是国内首款配备5英寸1080p屏幕的手机,采用Android 4.1系统,搭载了一颗主频1.5GHz的高通APQ8064四核处理器,提供了2GB内存以及16/32GB机身存储空...

匿名
2013/03/14
2.3K
1

没有更多内容

加载失败,请刷新页面

加载更多

Hacker News 简讯 2020-08-15

最后更新时间: 2020-08-15 06:01 Welders set off Beirut blast while securing explosives - (maritime-executive.com) 焊工在固定炸药的同时引爆了贝鲁特爆炸 得分:347 | 评论:302 Factor......

FalconChen
今天
24
0
OSChina 周六乱弹 —— 老椅小猫秋乡梦 梦里石台堆小鱼

Osc乱弹歌单(2020)请戳(这里) 【今日歌曲】 @小小编辑 :《MOM》- 蜡笔小心 《MOM》- 蜡笔小心 手机党少年们想听歌,请使劲儿戳(这里) @狄工 :腾讯又在裁员了,35岁以上清退,抖音看到...

小小编辑
今天
89
1
构建高性能队列,你不得不知道的底层知识!

前言 本文收录于专辑:http://dwz.win/HjK,点击解锁更多数据结构与算法的知识。 你好,我是彤哥。 上一节,我们一起学习了如何将递归改写为非递归,其中,用到的数据结构主要是栈。 栈和队列...

彤哥读源码
今天
17
0
Anaconda下安装keras和tensorflow

Anaconda下安装keras和tensorflow 一、下载并安装Anaconda: Anaconda下载 安装步骤: 如果是多用户操作系统选择All Users,单用户选择Just Me 选择合适的安装路径 然后勾选这个,自动配置环境...

Atlantis-Brook
今天
15
0
滴滴ElasticSearch千万级TPS写入性能翻倍技术剖析

桔妹导读:滴滴ElasticSearch平台承接了公司内部所有使用ElasticSearch的业务,包括核心搜索、RDS从库、日志检索、安全数据分析、指标数据分析等等。平台规模达到了3000+节点,5PB 的数据存储...

滴滴技术
今天
13
0

没有更多内容

加载失败,请刷新页面

加载更多

返回顶部
顶部