Download presentation
Presentation is loading. Please wait.
Published byEthan Richardson Modified over 9 years ago
1
Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data
2
Hardware Considerations Original Design Idea: Hadoop infrastructure is designed and developed to run on a normal commodity hardware. Actual Production Environment: Large production Hadoop clusters do require a proper hardware infrastructure planning to have a minimal latency in storing and processing of large volume of data. Hardware architecture should be carefully designed based on the nature of data and the jobs that will be run and the agreed SLA ©2013 OpalSoft Big Data
3
Hardware Considerations Various aspects of hadoop hardware infrastructure Servers (Named Node, Job Tracker, Data Node) Racks Network Switch Storage Backup Number of copies of data ©2013 OpalSoft Big Data
4
Name Node & Job Tracker Server RAM size should be sized based on the following – Number of data nodes in cluster – Approximate number of blocks that will be stored in the cluster – Number of different hadoop process that is run the machine ©2013 OpalSoft Big Data
5
I/O Adapters – Not a critical element as named node does not participate in data transfers. Processor – Minimally need multi core processors Standby Node Server – Should be as same capacity as primary named node ©2013 OpalSoft Big Data
6
Data Node Servers RAM size should be sized based on the following – Approximate number of blocks that will be stored in the cluster – Number of different hadoop process that is run the machine. ©2013 OpalSoft Big Data
7
I/O Adapters – High throughput I/O adapter is needed Processor – Need multi core, multi processors for parallel execution of more than one map reduce job Virtualization is not recommended ©2013 OpalSoft Big Data
8
Racks Hadoop is rack aware Configure hadoop with node’s rack information Servers should be distributed at least across two racks to prevent any data loss due to rack failure ©2013 OpalSoft Big Data
9
Hadoop automatically performs block replication across servers in two different racks Servers located in same rack has low latency of data transfer because, all data transfer occurs via rack’s network switch ©2013 OpalSoft Big Data
10
Network Switch Its recommended to have a separate private network for hadoop cluster. Both core and rack network switch should support high bandwidth duplex data transfer Higher capacity core and rack switch will be required if the number of data copies are more than the standard 3 copies. ©2013 OpalSoft Big Data
11
Storage Locally attached storage provide better performance than a NFS or SAN storage Hard disk with higher RPM provides better read/write throughput More number of smaller capacity hard disk should be used instead of a single large capacity disk will allow concurrent read/writes and reduces disk level bottle neck ©2013 OpalSoft Big Data
12
Named Node and Job Tracker servers should have raid configurations which is highly fault tolerant Data Node raid configuration is not as critical. Data is already replicated across multiple servers Using SSD will improve performance drastically at the expense of higher setup cost ©2013 OpalSoft Big Data
13
Backup Named Node data is the most critical information that needs more frequent backup Named node data is regularly streamed to a standby node to restore hadoop cluster operation in case of primary named node failure. ©2013 OpalSoft Big Data
14
Another backup server is recommended to perform periodic backup and checksum verification of named node data Data node backup is requirement depends upon the criticality and availability of data. Doesn’t require frequent backup though a regular back up is needed only to recover from a data center level failures. ©2013 OpalSoft Big Data
15
Number of Copies of Data Following determines the number of copies of a block – Criticality of data – Number of Concurrent map reduce jobs that will be executed in the data set – More replicas of data allows more number of jobs to be run concurrently on the same data set. – Reduces job execution time as most often all the required data is either available locally or at least in the same rack ©2013 OpalSoft Big Data
16
– Be aware, more number of replicas affects the write performance of hadoop cluster. – Having more replicas only for most frequently used data provides maximum benefit instead of having a general replication factor across the cluster. ©2013 OpalSoft Big Data
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.