文档章节

The Design of HDFS

杨尚川
 杨尚川
发布于 2015/04/08 06:05
字数 433
阅读 135
收藏 1
HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware. Let’s examine this statement in more detail

Very large files
“Very large” in this context means files that are hundreds of megabytes, gigabytes, or terabytes in size. There are Hadoop clusters running today that store petabytes of data.

Streaming data access
HDFS is built around the idea that the most efficient data processing pattern is a write-once, read-many-times pattern. A dataset is typically generated or copied from source, and then various analyses are performed on that dataset over time. Each analysis will involve a large proportion, if not all, of the dataset, so the time to read the whole dataset is more important than the latency in reading the first record.

Commodity hardware
Hadoop doesn’t require expensive, highly reliable hardware. It’s designed to run on clusters of commodity hardware (commonly available hardware that can be obtained from multiple vendors) for which the chance of node failure across the cluster is high, at least for large clusters. HDFS is designed to carry on working without a noticeable interruption to the user in the face of such failure. 

It is also worth examining the applications for which using HDFS does not work so well. Although this may change in the future, these are areas where HDFS is not a good fit today:

Low-latency data access
Applications that require low-latency access to data, in the tens of milliseconds range, will not work well with HDFS. Remember, HDFS is optimized for delivering a high throughput of data, and this may be at the expense of latency. HBase is currently a better choice for low-latency access.

Lots of small files
Because the namenode holds filesystem metadata in memory, the limit to the number of files in a filesystem is governed by the amount of memory on the namenode. As a rule of thumb, each file, directory, and block takes about 150 bytes. So, for example, if you had one million files, each taking one block, you would need at least 300 MB of memory. Although storing millions of files is feasible, billions is beyond the capability of current hardware.

Multiple writers, arbitrary file modifications
Files in HDFS may be written to by a single writer. Writes are always made at the end of the file. There is no support for multiple writers or for modifications at arbitrary offsets in the file. (These might be supported in the future, but they are likely to be relatively inefficient.) 
 
 
 
 
 
 
 
 

© 著作权归作者所有

杨尚川

杨尚川

粉丝 1102
博文 220
码字总数 1624053
作品 12
东城
架构师
私信 提问
轻量级深度学习库--MXNet

MXNet 是一款设计为效率和灵活性的深度学习框架。它允许你混合符号编程和命令式编程,从而最大限度提高效率和生产力。在其核心是一个动态的依赖调度,它能够自动并行符号和命令的操作。 有一...

匿名
2016/03/21
13.2K
0
Jide-Qian/mxnet

Apache MXNet (incubating) for Deep Learning Apache MXNet (incubating) is a deep learning framework designed for both efficiency and flexibility. It allows you to mix symbolic an......

Jide-Qian
2017/05/31
0
0
【翻译】Apache Hbase新特性--MOB支持(一)

原文链接:http://blog.cloudera.com/blog/2015/06/inside-apache-hbases-new-support-for-mobs/ HBase MOBs特性的设计背景 Apache HBase is a distributed, scalable, performant, consist......

jeff-qq
2017/10/18
755
0
Ibis: Scaling the Python Data Experience

Ibis: Scaling the Python Data Experience Ibis 0.5 (September 10, 2015) Ibis 0.5.0 is released. Read all about it Please also sign up for the mailing list. What is Ibis? Ibis is ......

openthings
2016/01/10
51
0
基于Spark SQL实现对HDFS操作的实时监控报警

1.前言 E-MapReduce计划从EMR-3.18.1版本开始提供Spark Streaming SQL的预览版功能。Spark Streaming SQL是在Spark Structured Streaming的基础上做了进一步封装,方便用户使用SQL语言进行S...

鱼跟猫
04/12
0
0

没有更多内容

加载失败,请刷新页面

加载更多

CSS盒子模型

一、什么叫框模型 页面元素皆为框(盒子) 定义了元素框处理元素内容,内边距,外边距以及边框的计算方式 二、外边距 围绕在元素边框外的空白距离(元素与元素之间的距离) 语法:margin,定...

wytao1995
今天
4
0
Replugin借助“UI进程”来快速释放Dex

public static boolean preload(PluginInfo pi) { if (pi == null) { return false; } // 借助“UI进程”来快速释放Dex(见PluginFastInstallProviderProxy的说明) return PluginFastInsta......

Gemini-Lin
今天
4
0
Hibernate 5 的模块/包(modules/artifacts)

Hibernate 的功能被拆分成一系列的模块/包(modules/artifacts),其目的是为了对依赖进行独立(模块化)。 模块名称 说明 hibernate-core 这个是 Hibernate 的主要(main (core))模块。定义...

honeymoose
今天
4
0
精华帖

第一章 jQuery简介 jQuery是一个JavaScript库 jQuery具备简洁的语法和跨平台的兼容性 简化了JavaScript的操作。 在页面中引入jQuery jQuery是一个JavaScript脚本库,不需要特别的安装,只需要...

流川偑
今天
7
0
语音对话英语翻译在线翻译成中文哪个方法好用

想要进行将中文翻译成英文,或者将英文翻译成中文的操作,其实有一个非常简单的工具就能够帮助完成将语音进行翻译转换的软件。 在应用市场或者百度手机助手等各大应用渠道里面就能够找到一款...

401恶户
今天
3
0

没有更多内容

加载失败,请刷新页面

加载更多

返回顶部
顶部