Download presentation
Presentation is loading. Please wait.
1
How to monitor the $H!T out of Hadoop
Developing a comprehensive open approach to monitoring hadoop clusters
2
Relevant Hadoop Information
From 3 – 3000 Nodes Hardware/Software failures “common” Redundant Components DataNode, TaskTracker Non-redundant Components NameNode, JobTracker, SecondaryNameNode Fast Evolving Technology (Best Practices?)
3
Monitoring Software Nagios – Red Yellow Green Alerts, Escalations
Defacto Standard – Widely deployed Text base configuration Web Interface Pluggable with shell scripts/external apps Return 0 - OK
4
Cacti Performance Graphing System RRD/RRA Front End
Slick Web Interface Template System for Graph Types Pluggable SNMP input Shell script /external program
6
hadoop-cacti-jtg JMX Fetching Code w/ (kick off) scripts
Cacti templates For Hadoop Premade Nagios Check Scripts Helper/Batch/automation scripts Apache License
7
Hadoop JMX
8
Sample Cluster P1 NameNode & SecNameNode JobTracker Hardware RAID
8 GB RAM 1x QUAD CORE DerbyDB (hive) on SecNameNode JobTracker 8GB RAM
9
A Sample Cluster p2 Slave (hadoopdata1-XXXX) JBOD 8x 1TB SATA Disk
RAM 16GB 2x Quad Core
10
Prerequisites Nagios (install) DAG RPMs Cacti (install) Several RPMS
Liberal network access to the cluster
11
Alerts & Escalations X nodes * Y Services = < Sleep Define a policy
Wake Me Up’s (SMS) Don’t Wake Me Up’s ( ) Review (Daily, Weekly, Monthly)
12
Wake Me Up’s NameNode JobTracker SecNameNode
Disk Full (Big Big Headache) RAID Array Issues (failed disk) JobTracker SecNameNode Do not realize it is not working too late
13
Don’t Wake Me Up’s Or ‘Wake someone else up’ DataNode TaskTracker
Warning Currently Failed Disk will down the Data Node (see Jira) TaskTracker Hardware Bad Disk (Start RMA) Slaves are expendable (up to a point)
14
Monitoring Battle Plan
Start With the Basics Ping, Disk Add Hadoop Specific Alarms check_data_node Add JMX Graphing NameNodeOperations Add JMX Based alarms FilesTotal > 1,000,000 or LiveNodes < 50%
15
The Basics Nagios Nagios (All Nodes)
Host up (Ping check) Disk % Full SWAP > 85 % * Load based alarms are somewhat useless 389% CPU load is not necessarily a bad thing in Hadoopville
16
The Basics Cacti Cacti (All Nodes) CPU (full CPU) RAM/SWAP Network
Disk Usage
17
Disk Utilization
18
RAID Tools Hpacucli – not a Street Fighter move
Alerts on RAID events (NameNode) Disk failed Rebuilding JBOD (DataNode) Failed Drive Drive Errors Dell, SUN, Vendor Specific Tools
19
Before you jump in X Nodes * Y Checks * = Lots of work
About 3 Nodes into the process … Wait!!! I need some interns!!! Solution S.I.C.C.T. Semi-Intelligent-Configuration-cloning-tools (I made that up) (for this presentation)
20
Nagios Answers “IS IT RUNNING?” Text based Configuration
21
Cacti Answers “HOW WELL IS IT RUNNING?” Web Based configuration
php-cli tools
22
Monitoring Battle Plan Thus Far
Start With the Basics Ping, Disk !!!!!!Done!!!!!! Add Hadoop Specific Alarms check_data_node Add JMX Graphing NameNodeOperations Add JMX Based alarms FilesTotal > 1,000,000 or LiveNodes < 50%
23
Add Hadoop Specific Alarms
Hadoop Components with a Web Interface NameNode 50070 JobTracker 50030 TaskTracker 50060 DataNode 50075 check_http + regex = simple + effective
24
nagios_check_commands.cfg Component Failure
define command { command_name check_remote_namenode command_line $USER1$/check_http -H $HOSTADDRESS$ -u -p $ARG1$ -r NameNode } define service { service_description check_remote_namenode use generic-service host_name hadoopname1 check_command check_remote_namenode!50070 } Component Failure (Future) Newer Hadoop will have XML status
25
Monitoring Battle Plan
Start With the Basics Ping, Disk (Done) Add Hadoop Specific Alarms check_data_node (Done) Add JMX Graphing NameNodeOperations Add JMX Based alarms FilesTotal > 1,000,000 or LiveNodes < 50%
26
JMX Graphing Enable JMX Import Templates
27
JMX Graphing
28
JMX Graphing
29
JMX Graphing
31
Standard Java JMX
32
Monitoring Battle Plan Thus Far
Start With the Basics !!!!!!Done!!!!! Ping, Disk Add Hadoop Specific Alarms !Done! check_data_node Add JMX Graphing !Done! NameNodeOperations Add JMX Based alarms FilesTotal > 1,000,000 or LiveNodes < 50%
33
Add JMX based Alarms hadoop-cacti-jtg is flexible extend fetch classes
Don’t call output() Write your own check logic
34
Quick JMX Base Walkthrough
url, user, pass, object specified from CLI wantedVariables, wantedOperations by inheritance fetch() output() provided
35
Extend for NameNode
36
Extend for Nagios
37
Monitoring Battle Plan
Start With the Basics !DONE! Ping, Disk Add Hadoop Specific Alarms !DONE! check_data_node Add JMX Graphing !DONE! NameNodeOperations Add JMX Based alarms !DONE! FilesTotal > 1,000,000 or LiveNodes < 50%
38
Review File System Growth Utilization Email (nightly) Size
Number of Files Number of Blocks Ratio’s Utilization CPU/Memory Disk (nightly) FSCK DSFADMIN
39
The Future JMX Coming to JobTracker and TaskTracker (0.21)
Collect and Graph Jobs Running Collect and Graph Map / Reduce per node Profile Specific Jobs in Cacti?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.