文档章节

【翻译】给新的Hadoop集群选择合适的硬件(三)

jeff-qq
 jeff-qq
发布于 2017/03/10 10:07
字数 1769
阅读 77
收藏 0

接上一篇:https://my.oschina.net/u/234661/blog/855913

其他考虑因素

It is important to remember that the Hadoop ecosystem is designed with a parallel environment in mind. When purchasing processors, we do not recommended getting the highest GHz chips, which draw high watts (130+). This will cause two problems: higher consumption of power and greater heat expulsion. The mid-range models tend to offer the best bang for the buck in terms of GHz, price, and core count.

时刻牢记,Hadoop生态系统被设计成并行的环境。当购买处理器的时候,不推荐购买主频最高、能耗高(超过130w)的芯片,这会引起2个问题:更高电量消耗和更大的发热量。

When we encounter applications that produce large amounts of intermediate data — outputting data on the same order as the amount read in — we recommend two ports on a single Ethernet card or two channel-bonded Ethernet cards to provide 2 Gbps per machine. Bonded 2Gbps is tolerable for up to about 12TB of data per nodes. Once you move above 12TB, you will want to move to bonded 4Gbps(4x1Gbps). Alternatively, for customers that have already moved to 10 Gigabit Ethernet or Infiniband, these solutions can be used to address network-bound workloads. Confirm that your operating system and BIOS are compatible if you’re considering switching to 10 Gigabit Ethernet.

当应用程序产生大量的中间数据时,数据的输出顺序跟读入顺序保持一致。推荐一个网卡开放2个端口或者通道聚合网卡来提供每个机器2Gbps的带宽。2Gbps的带宽可以容纳每个节点12TB的数据。一旦你需要移动12TB的数据,你需要4Gbps的带宽(4*1Gbps)。当然,很多客户已经用上了万兆以太网卡或者无限带宽,这个解决方案可以用在受网速限制的工作场景中。记得确认你的操作系统或BIOS是否兼容万兆网卡。

When computing memory requirements, remember that Java uses up to 10 percent of it for managing the virtual machine. We recommend configuring Hadoop to use strict heap size restrictions in order to avoid memory swapping to disk. Swapping greatly impacts MapReduce job performance and can be avoided by configuring machines with more RAM, as well as setting appropriate kernel settings on most Linux distributions.

当计算内存时,记住Java虚拟机最多用10%来管理Java虚拟机。建议配置Hadoop时,明确堆的大小(heap size)来避免内存数据与磁盘交换。磁盘交换严重影响了MapReduce的性能,增加内存可以避免这个问题,当然,大多数发行版linux系统也可以通过修改恰当的内核设置。

It is also important to optimize RAM for the memory channel width. For example, when using dual-channel memory, each machine should be configured with pairs of DIMMs. With triple-channel memory each machine should have triplets of DIMMs. Similarly, quad-channel DIMM should be in groups of four.

内存通道带宽堆优化内存也很重要。举个例子,双通道内存,每个机器需要配置的2个一组DDIM。三通道配置成3个一组DDIM。同理,4通道内存,需要4个组成一组DDIM。

不只是MapReduce

Hadoop is far bigger than HDFS and MapReduce; it’s an all-encompassing data platform. For that reason, CDH includes many different ecosystem products (and, in fact, is rarely used solely for MapReduce). Additional software components to consider when sizing your cluster include Apache HBase, Cloudera Impala, and Cloudera Search. They should all be run on the DataNode process to maintain data locality.

Hadoop不只有HDFS和MapRecude。他是一个全面的数据平台。CDH包含了许多不同的程序(事实上,很少单独的用MapReduce。其他软件也需要被考虑在集群内,包括Hbase,Cloudera Impala,Cloudera Search。他们应该运行在DataNode进程上就地维护数据。

HBase is a reliable, column-oriented data store that provides consistent, low-latency, random read/write access. Cloudera Search solves the need for full text search on content stored in CDH to simplify access for new types of users, but also open the door for new types of data storage inside Hadoop. Cloudera Search is based on Apache Lucene/Solr Cloud and Apache Tika and extends valuable functionality and flexibility for search through its wider integration with CDH. The Apache-licensed Impala project brings scalable parallel database technology to Hadoop, enabling users to issue low-latency SQL queries to data stored in HDFS and HBase without requiring data movement or transformation.

HBase是一个可靠的,面向列的、连续的,低延迟、随机读写数据库。

CS基于Lucene、Solr、Tika。兼顾了灵活性和功能性,因此广泛继承在CDH中。基于Apache协议的Impala项目带来了可扩展的数据库并行技术,不需要数据移动或转换就解决了HDFS和HBase低延时SQL查询的问题。

HBase users should be aware of heap-size limits due to garbage collector (GC) timeouts. Other JVM column stores also face this issue. Thus, we recommend a maximum of ~16GB heap per Region Server. HBase does not require too many other resources to run on top of Hadoop, but to maintain real-time SLAs you should use schedulers such as fair and capacity along with Linux Cgroups.

HBase用户需要关注堆内存大小的限制,由于垃圾回收暂停。其他基于JVM的列存储系统都会面临这一问题。因此,我们建议每个RegionServer最大分配16GB堆内存。HBase在Hadoop上不需要太多其他的资源,但是为了保持实时的SLA ,应该使用调度器,如Linxu控制组提供的公平调度器和容量调度器。

Impala uses memory for most of its functionality, consuming up to 80 percent of available RAM resources under default configurations, so we recommend at least 96GB of RAM per node. Users that run Impala alongside MapReduce should consult our recommendations in “Configuring Impala and MapReduce for Multi-tenant Performance.” It is also possible to specify a per-process or per-query memory limit for Impala.

Impala大多数功能需要用到内存,默认配置下最大可以消耗80%的可用内存,建议每个节点至少96GB内存。Impala和MapReduce一起运行时,需要查看我们的意见。当然也可以指定一个核心或者每个查询的内存限制。

Search is the most interesting component to size. The recommended sizing exercise is to purchase one node, install Solr and Lucene, and load your documents. Once the documents are indexed and searched in the desired manner, scalability comes into play. Keep loading documents until the indexing and query latency exceed necessary values to the project — this will give you a baseline for max documents per node based on available resources and a baseline count of nodes not including and desired replication factor.

确定搜索组件的大小是最有趣的。推荐购买一个节点,安装Solr和Lucene,加载数据来再进行大小调整。当索引、搜索文档满足不了的时候,扩展性就发挥作用了。一致加载文档直到索引和查询延迟超过项目指定的值。这就是每个节点放多少大文档的基准线和节点的基线数量(不包括期望的副本因素)。

总结

Purchasing appropriate hardware for a Hadoop cluster requires benchmarking and careful planning to fully understand the workload. However, Hadoop clusters are commonly heterogeneous and Cloudera recommends deploying initial hardware with balanced specifications when getting started. It is important to remember when using multiple ecosystem components resource usage will vary and focusing on resource management will be your key to success.

We encourage you to chime in about your experience configuring production Hadoop clusters in comments!

购买恰当的硬件需要基准测试和仔细的规划才能完全弄明白工作场景。然而,Hadoop集群通常各种各样。我们建议初次部署配置均衡的硬件。最重要的是,不同生态组建的需求说明是不一样的,关注资源管理才是关键。

Kevin O’Dell is a Systems Engineer at Cloudera.

© 著作权归作者所有

jeff-qq
粉丝 2
博文 16
码字总数 15215
作品 0
济南
高级程序员
私信 提问
Hadoop集群选择合适的硬件配置

为Hadoop集群选择合适的硬件配置 随着Apache Hadoop的起步,云客户的增多面临的首要问题就是如何为他们新的的Hadoop集群选择合适的硬件。 尽管Hadoop被设计为运行在行业标准的硬件上,提出一...

李伟铭k
2018/07/09
0
0
为你的 Hadoop 集群选择合适的硬件

随着Apache Hadoop的起步,云客户的增多面临的首要问题就是如何为他们新的的Hadoop集群选择合适的硬件。 尽管Hadoop被设计为运行在行业标准的硬件上,提出一个理想的集群配置不想提供硬件规格...

一枚Sir
2014/07/17
77
0
为你的 Hadoop 集群选择合适的硬件

随着Apache Hadoop的起步,云客户的增多面临的首要问题就是如何为他们新的的Hadoop集群选择合适的硬件。 尽管Hadoop被设计为运行在行业标准的硬件上,提出一个理想的集群配置不想提供硬件规格...

argszero
2013/09/02
11.3K
0
零基础学习hadoop到上手工作线路指导(初级篇)

零基础学习hadoop,没有想象的那么困难,也没有想象的那么容易。才刚接触大数据时,曾经想过培训,但是培训机构的选择就让我很纠结。因为师兄跟我说,很多培训机构的老师其实以前就是他们的学...

我思gu我在
2017/01/09
211
3
Hadoop完全分布式安装以及配置教程

Hadoop完全分布式安装 在此主要介绍hadoop完全分布式的安装配置。 因为我们硬件设施有限,所以我们采用虚拟机的方式模拟hadoop集群,我们准备建立四台虚拟机,一台机器master作为管理节点,其...

u012045426的博客
2017/12/21
0
0

没有更多内容

加载失败,请刷新页面

加载更多

oracle ORA-39700: database must be opened with UPGRADE option

ORA-01092: ORACLE instance terminated. Disconnection forced ORA-00704: bootstrap process failure ORA-39700: database must be opened with UPGRADE option 进程 ID: 3650 会话 ID: 29......

Tank_shu
今天
1
0
分布式协调服务zookeeper

ps.本文为《从Paxos到Zookeeper 分布式一致性原理与实践》笔记之一 ZooKeeper ZooKeeper曾是Apache Hadoop的一个子项目,是一个典型的分布式数据一致性的解决方案,分布式应用程序可以基于它...

ls_cherish
今天
4
0
聊聊DubboDefaultPropertiesEnvironmentPostProcessor

序 本文主要研究一下DubboDefaultPropertiesEnvironmentPostProcessor DubboDefaultPropertiesEnvironmentPostProcessor dubbo-spring-boot-project-2.7.3/dubbo-spring-boot-compatible/au......

go4it
昨天
1
0
redis 学习2

网站 启动 服务端 启动redis 服务端 在redis 安装目录下 src 里面 ./redis-server & 可以指定 配置文件或者端口 客户端 在 redis 的安装目录里面的 src 里面 ./redis-cli 可以指定 指定 连接...

之渊
昨天
2
0
Spring boot 静态资源访问

0. 两个配置 spring.mvc.static-path-patternspring.resources.static-locations 1. application中需要先行的两个配置项 1.1 spring.mvc.static-path-pattern 这个配置项是告诉springboo......

moon888
昨天
4
0

没有更多内容

加载失败,请刷新页面

加载更多

返回顶部
顶部