『Big data technologies』关于各类大数据技术概念的简介(翻译自Quora)

发布于 2016/07/18 10:35
字数 1951
阅读 17
收藏 1
点赞 0
评论 0


I'll try to give a very crude overview of how the pieces fit in together, because the details span multiple books. Please forgive me for some oversimplifications.

  • MapReduce is the Google paper that started it all (Page on googleusercontent.com). It's a paradigm for writing distributed code inspired by some elements of functional programming. You don't have to do things this way, but it neatly fits a lot of problems we try to solve in a distributed way. The Google internal implementation is called MapReduce and Hadoop is it's open-source implementation. Amazon's Hadoop instance is called Elastic MapReduce (EMR) and has plugins for multiple languages.
  • HDFS is an implementation inspired by the Google File System (GFS) to store files across a bunch of machines when it's too big for one. Hadoop consumes data in HDFS (Hadoop Distributed File System).
  • Apache Spark is an emerging platform that has more flexibility than MapReduce but more structure than a basic message passing interface. It relies on the concept of distributed data structures (what it calls RDDs) and operators. See this page for more: The Apache Software Foundation
  • Because Spark is a lower level thing that sits on top of a message passing interface, it has higher level libraries to make it more accessible to data scientists. The Machine Learning library built on top of it is called MLib and there's a distributed graph library called GraphX.
  • Pregel and it's open source twin Giraph is a way to do graph algorithms on billions of nodes and trillions of edges over a cluster of machines. Notably, the MapReduce model is not well suited to graph processing so Hadoop/MapReduce are avoided in this model, but HDFS/GFS is still used as a data store.
  • Zookeeper is a coordination and synchronization service that a distributed set of computer make decisions by consensus, handles failure, etc.
  • Flume and Scribe are logging services, Flume is an Apache project and Scribe is an open-source Facebook project. Both aim to make it easy to collect tons of logged data, analyze it, tail it, move it around and store it to a distributed store.
  • Google BigTable and it's open source twin HBase were meant to be read-write distributed databases, originally built for the Google Crawler that sit on top of GFS/HDFS and MapReduce/Hadoop. Google Research Publication: BigTable
  • Hive and Pig are abstractions on top of Hadoop designed to help analysis of tabular data stored in a distributed file system (think of excel sheets too big to store on one machine). They operate on top of a data warehouse, so the high level idea is to dump data once and analyze it by reading and processing it instead of updating cells and rows and columns individually much. Hive has a language similar to SQL while Pig is inspired by Google's Sawzall - Google Research Publication: Sawzall. You generally don't update a single cell in a table when processing it with Hive or Pig.
  • Hive and Pig turned out to be slow because they were built on Hadoop which optimizes for the volume of data moved around, not latency. To get around this, engineers bypassed and went straight to HDFS. They also threw in some memory and caching and this resulted in Google's Dremel (Dremel: Interactive Analysis of Web-Scale Datasets), F1 (F1 - The Fault-Tolerant Distributed RDBMS Supporting Google's Ad Business), Facebook's Presto (Presto | Distributed SQL Query Engine for Big Data), Apache Spark SQL (Page on apache.org ), Cloudera Impala (Cloudera Impala: Real-Time Queries in Apache Hadoop, For Real), Amazon's Redshift, etc. They all have slightly different semantics but are essentially meant to be programmer or analyst friendly abstractions to analyze tabular data stored in distributed data warehouses.
  • Mahout (Scalable machine learning and data mining) is a collection of machine learning libraries written in the MapReduce paradigm, specifically for Hadoop. Google has it's own internal version but they haven't published a paper on it as far as I know.
  • Oozie is a workflow scheduler. The oversimplified description would be that it's something that puts together a pipeline of the tools described above. For example, you can write an Oozie script that will scrape your production HBase data to a Hive warehouse nightly, then a Mahout script will train with this data. At the same time, you might use pig to pull in the test set into another file and when Mahout is done creating a model you can pass the testing data through the model and get results. You specify the dependency graph of these tasks through Oozie (I may be messing up terminology since I've never used Oozie but have used the Facebook equivalent).
  • Lucene is a bunch of search-related and NLP tools but it's core feature is being a search index and retrieval system. It takes data from a store like HBase and indexes it for fast retrieval from a search query. Solr uses Lucene under the hood to provide a convenient REST API for indexing and searching data. ElasticSearch is similar to Solr.
  • Sqoop is a command-line interface to back SQL data to a distributed warehouse. It's what you might use to snapshot and copy your database tables to a Hive warehouse every night.
  • Hue is a web-based GUI to a subset of the above tools - http://gethue.com/



  • MapReduce - 来自谷歌的一篇论文。是一种受到一些函数式编程元素所启发的编写分布式代码的范例。Google内部将它叫做MapReduce,且Hadoop是它的一种开源实现。Amazon的Hadoop实例叫做EMR(Elastic MapReduce)。
  • HDFS - 是一种受到GFS(Google File System)启发的实现,用于在集群中存储文件。Hadoop使用HDFS用于存储。
  • Apache Spark - 是一种比MapReduce更加适用的工程平台,相对于在传递时使用基础的消息,需要更多的结构。它依赖于分布式数据结构(RDDs)和操作的概念。点击查看更多
  • 因为Spark在顶层信息传递时使用更底层的方式,它对于数据科学有更高层的库来带来更好的可访问性。机器学习库构建在它的顶层叫MLib,并且有一个叫GraphX的分布式图形库。
  • Pregel及它的开源实现Giraph 是一种处理十亿级节点和万亿级边界覆盖机器簇。值得注意的是MapReduce模型与图形处理不是很适配,所以Hadoop/MapReduce不适合这个模型,但是HDFS/GFS仍然用于数据存储。
  • Zookeeper 是一种分布式计数机集的同步和异步服务。根据一致性、处理失败等来制定决策。
  • FlumeScribe是一种日志服务。Flume是一个Apache项目,Scribe是一个开源的Facebook项目。它们两个旨在让收集大量的日志数据,然后分析这些数据、跟踪这些数据、移动这些数据围绕/存储在分布式存储中。
  • Google BigTable以及他的开源实现HBase,用于读/写分布式数据库,源于构建GoogleCrawler,处于GFS/HDFS和MapReduce/Hadoop的顶层。更多点击查看
  • HivePig是在Hadoop的抽象,被设计用于分析存储于分布式文件系统的平滑数据(tabular data)。他们操作在一个数据仓库的顶层,所以高层级的思路是加载数据一次,然后通过读和处理来分析它,代替分别修改单元格、行、列。Hive有一种类似SQL的语言,Pig是受Google的Sawzall启发。当使用Hive或者Pig处理数据的时候,通常不用更新在表格中的单独的一个单元格。
  • 因为构建在Hadoop(围绕优化处理数据容量)上,所以Hive和Pig会变得比较慢。基于这个原因,工程师们绕过它,并直接面向HDFS。
  • Mahout - 一种写在MapReduce范例中的机器学籍库的集合。
  • Oozie - 一种工作流序列。简单的描述,它用于集成上面描述的各种工具的管道。例如,你可以写一个Oozie脚本,用来从HBase中提取数据到一个Hive数据仓库,然后使用一个Mahout脚本来使用这个数据完成‘训练’。同时,你可以使用pig来测试集到另一个文件,当Mahout已经创建一个模型,你可以传递测试数据到这个模型,然后获取结果。你通过Oozie指定这些任务的图形依赖。
  • Lucene - 是一个搜索相关和NLP工具的集,它的核心功能是一个搜索索引以及一个检索系统。它从一个存储(如HBase)中取得数据,然后索引化它,以便更快的从一个搜索结果中检索数据。Solr使用Lucene在遮盖下来提供一种便利的REST API,来索引和搜索数据。ElasticSearch与Solr相似。
  • Sqoop是一个命令行接口,来处理SQL数据到一个分布式仓库。它用于处理快照和复制你的数据表到一个Hive仓库中。
  • Hue是一个基于Web的GUI,针对以上工具的一个子集。

© 著作权归作者所有

共有 人打赏支持
粉丝 4
博文 96
码字总数 95147
作品 0

     大数据文摘作品   编译:王一丁、王梦泽、夏雅薇   本文给想进入大数据领域的朋友提供了一系列的资源,由浅入深,比如“需要了解的51条大数据术语”、“学习python的四个理由”...

大数据文摘 ⋅ 04/29 ⋅ 0

解析 :跻身数据科学领域的五条职业规划道路

翻译:卢苗苗、梁傅淇;校对:吕艳芹;作者:Matthew Mayo 原文链接:http://www.kdnuggets.com/2017/02/5-career-paths-data-science-big-data-explained.html 本文长度为4970字,建议阅读6...

tw6cy6ukydea86z ⋅ 04/25 ⋅ 0

Martin Fowler对于nosql的看法

The rise of NoSQL databases marks the end of the era of relational database dominance NoSQL数据库的崛起标志这个关系型数据库统治时代的终结。 But NoSQL databases will not become ......

PaperHY ⋅ 2014/03/18 ⋅ 0

相比于传统 BI,基于 Hadoop 的大数据(Big Data)战略有何不同?

作者:miao君 链接:https://www.zhihu.com/question/20357162/answer/142407798 来源:知乎 著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。 可以参考这个问题的回答...

chenhao_asd ⋅ 04/23 ⋅ 0



加米谷 ⋅ 05/16 ⋅ 0


本文原作者:黄广斌 新加坡南洋理工大学 电子电气工程学院 副教授 原文转载自:https://mp.weixin.qq.com/s/R-y3GIrMEhqU2ivEMABBLw 点题:学界发现真理,产业界利用趋势。“神人”就是既发现...

u013709270 ⋅ 2017/08/27 ⋅ 0

开源 Java 博客平台 - NewStarBlog

NewStarBlog is an open source Java blog platform. NewStarBlog 是开源 Java 博客平台。 Technologies 涉及技术 NewStarBlog covers Spring Framework 5, Spring Boot 2, Thymeleaf, Elast......

waylau ⋅ 04/26 ⋅ 0


大数据: 大数据(big data),指无法在一定时间范围内用常规软件工具进行捕捉、管理和处理的数据集合,是需要新处理模式才能具有更强的决策力、洞察发现力和流程优化能力的海量、高增长率和...

mcy0425 ⋅ 04/25 ⋅ 0


2017年中国大数据发展趋势和展望解读(下) 神奇的大数据 现在的社会高速发展,我们处在大数据的洪流中,随着互联网、物联网等的发展,人们之间的交流越来越密切,生活也越来越方便,大数据就是...

huangshulang1234 ⋅ 01/13 ⋅ 0


《数学之美》;作者吴军大家都很熟悉。这本书主要的作用是引起了我对机器学习和自然语言处理的兴趣。里面以极为通俗的语言讲述了数学在这两个领域的应用。 《统计学习方法》;作者李航,是国...

goodlook ⋅ 2016/11/01 ⋅ 0






本文介绍RabbitMQ与Spring的简单集成以及消息的发送和接收。 在RabbitMQ的Spring配置文件中,首先需要增加命名空间。 xmlns:rabbit="http://www.springframework.org/schema/rabbit" 其次是模...

onedotdot ⋅ 29分钟前 ⋅ 0


最近过年发红包拜年成为一种新的潮流,作为程序猿对算法的好奇远远要大于对红包的好奇,这里介绍一种自己想到的一种随机红包分配策略,还请大家多多指教。 算法介绍 一、红包金额限制 对于微...

小致dad ⋅ 41分钟前 ⋅ 0

Python 数电表格格式化 xlutils xlwt xlrd的使用

需要安装 xlutils xlwt xlrd 格式化前 格式化后 代码 先copy读取的表格,然后按照一定的规则修改,将昵称中的学号提取出来替换昵称即可 from xlrd import open_workbookfrom xlutils.copy ...

阿豪boy ⋅ 今天 ⋅ 0


前言 读研究生这3 年,思维与本科相比变化挺大的,这几年除了看论文、设计方案,更重要的是学会注重先思考、再实现,感觉更加成熟吧,不再像个小P孩,人年轻时总会心高气傲。有1 道面试题:给...

初雪之音 ⋅ 今天 ⋅ 0

Docker Toolbox Looks like something went wrong

Docker Toolbox 重新安装后提示错误:Looks like something went wrong in step ´Checking if machine default exists´ 控制面板-->程序与应用-->启用或关闭windows功能:找到Hyper-V,如果处......

随你疯 ⋅ 今天 ⋅ 0

Guacamole 远程桌面

本文将Apache的guacamole服务的部署和应用,http://guacamole.apache.org/doc/gug/ 该链接下有全部相关知识的英文文档,如果水平ok,可以去这里仔细查看。 一、简介 Apache Guacamole 是无客...

千里明月 ⋅ 今天 ⋅ 0

nagios 安装

Nagios简介:监控网络并排除网络故障的工具:nagios,Ntop,OpenVAS,OCS,OSSIM等开源监控工具。 可以实现对网络上的服务器进行全面的监控,包括服务(apache、mysql、ntp、ftp、disk、qmail和h...

寰宇01 ⋅ 今天 ⋅ 0


默认情况下创建Dart项目应出现以下列表: 有时会因为不知明的原因导致列表项缺失: 此时可以通过以下步骤解决: 1.创建项目涉及到的包:stagehand 2.执行pub global activate stagehand或pub...

scooplol ⋅ 今天 ⋅ 0

Java Web如何操作Cookie的添加修改和删除

创建Cookie对象 Cookie cookie = new Cookie("id", "1"); 修改Cookie值 cookie.setValue("2"); 设置Cookie有效期和删除Cookie cookie.setMaxAge(24*60*60); // Cookie有效时间 co......

二营长意大利炮 ⋅ 今天 ⋅ 0


我是JQuery新手爱好者,有时间就练练代码,防止手生,争取每天一个JQuery练习,在这个博客记录下学习的笔记。 本特效主要采用fadeIn()和fadeOut()方法显示淡入淡出的显示效果显示或隐藏元...

Rhymo-Wu ⋅ 今天 ⋅ 0