It is important to remember that the Hadoop ecosystem is designed with a parallel environment in mind. When purchasing processors, we do not recommended getting the highest GHz chips, which draw high watts (130+). This will cause two problems: higher consumption of power and greater heat expulsion. The mid-range models tend to offer the best bang for the buck in terms of GHz, price, and core count.
When we encounter applications that produce large amounts of intermediate data — outputting data on the same order as the amount read in — we recommend two ports on a single Ethernet card or two channel-bonded Ethernet cards to provide 2 Gbps per machine. Bonded 2Gbps is tolerable for up to about 12TB of data per nodes. Once you move above 12TB, you will want to move to bonded 4Gbps(4x1Gbps). Alternatively, for customers that have already moved to 10 Gigabit Ethernet or Infiniband, these solutions can be used to address network-bound workloads. Confirm that your operating system and BIOS are compatible if you’re considering switching to 10 Gigabit Ethernet.
When computing memory requirements, remember that Java uses up to 10 percent of it for managing the virtual machine. We recommend configuring Hadoop to use strict heap size restrictions in order to avoid memory swapping to disk. Swapping greatly impacts MapReduce job performance and can be avoided by configuring machines with more RAM, as well as setting appropriate kernel settings on most Linux distributions.
It is also important to optimize RAM for the memory channel width. For example, when using dual-channel memory, each machine should be configured with pairs of DIMMs. With triple-channel memory each machine should have triplets of DIMMs. Similarly, quad-channel DIMM should be in groups of four.
Hadoop is far bigger than HDFS and MapReduce; it’s an all-encompassing data platform. For that reason, CDH includes many different ecosystem products (and, in fact, is rarely used solely for MapReduce). Additional software components to consider when sizing your cluster include Apache HBase, Cloudera Impala, and Cloudera Search. They should all be run on the DataNode process to maintain data locality.
Hadoop不只有HDFS和MapRecude。他是一个全面的数据平台。CDH包含了许多不同的程序（事实上，很少单独的用MapReduce。其他软件也需要被考虑在集群内，包括Hbase，Cloudera Impala，Cloudera Search。他们应该运行在DataNode进程上就地维护数据。
HBase is a reliable, column-oriented data store that provides consistent, low-latency, random read/write access. Cloudera Search solves the need for full text search on content stored in CDH to simplify access for new types of users, but also open the door for new types of data storage inside Hadoop. Cloudera Search is based on Apache Lucene/Solr Cloud and Apache Tika and extends valuable functionality and flexibility for search through its wider integration with CDH. The Apache-licensed Impala project brings scalable parallel database technology to Hadoop, enabling users to issue low-latency SQL queries to data stored in HDFS and HBase without requiring data movement or transformation.
HBase users should be aware of heap-size limits due to garbage collector (GC) timeouts. Other JVM column stores also face this issue. Thus, we recommend a maximum of ~16GB heap per Region Server. HBase does not require too many other resources to run on top of Hadoop, but to maintain real-time SLAs you should use schedulers such as fair and capacity along with Linux Cgroups.
Impala uses memory for most of its functionality, consuming up to 80 percent of available RAM resources under default configurations, so we recommend at least 96GB of RAM per node. Users that run Impala alongside MapReduce should consult our recommendations in “Configuring Impala and MapReduce for Multi-tenant Performance.” It is also possible to specify a per-process or per-query memory limit for Impala.
Search is the most interesting component to size. The recommended sizing exercise is to purchase one node, install Solr and Lucene, and load your documents. Once the documents are indexed and searched in the desired manner, scalability comes into play. Keep loading documents until the indexing and query latency exceed necessary values to the project — this will give you a baseline for max documents per node based on available resources and a baseline count of nodes not including and desired replication factor.
Purchasing appropriate hardware for a Hadoop cluster requires benchmarking and careful planning to fully understand the workload. However, Hadoop clusters are commonly heterogeneous and Cloudera recommends deploying initial hardware with balanced specifications when getting started. It is important to remember when using multiple ecosystem components resource usage will vary and focusing on resource management will be your key to success.
We encourage you to chime in about your experience configuring production Hadoop clusters in comments!
Kevin O’Dell is a Systems Engineer at Cloudera.