Presentation is loading. Please wait.

Presentation is loading. Please wait.

How to monitor the $H!T out of Hadoop Developing a comprehensive open approach to monitoring hadoop clusters.

Similar presentations


Presentation on theme: "How to monitor the $H!T out of Hadoop Developing a comprehensive open approach to monitoring hadoop clusters."— Presentation transcript:

1 How to monitor the $H!T out of Hadoop Developing a comprehensive open approach to monitoring hadoop clusters

2 Relevant Hadoop Information From 3 – 3000 Nodes Hardware/Software failures common Redundant Components DataNode, TaskTracker Non-redundant Components NameNode, JobTracker, SecondaryNameNode Fast Evolving Technology (Best Practices?)

3 Monitoring Software Nagios – –Red Yellow Green Alerts, Escalations –Defacto Standard – Widely deployed –Text base configuration –Web Interface –Pluggable with shell scripts/external apps Return 0 - OK

4 Cacti Performance Graphing System RRD/RRA Front End Slick Web Interface Template System for Graph Types Pluggable –SNMP input –Shell script /external program

5

6 hadoop-cacti-jtg JMX Fetching Code w/ (kick off) scripts Cacti templates For Hadoop Premade Nagios Check Scripts Helper/Batch/automation scripts Apache License

7 Hadoop JMX

8 Sample Cluster P1 NameNode & SecNameNode –Hardware RAID –8 GB RAM –1x QUAD CORE – DerbyDB (hive) on SecNameNode JobTracker –8GB RAM –1x QUAD CORE

9 A Sample Cluster p2 Slave (hadoopdata1-XXXX) –JBOD 8x 1TB SATA Disk –RAM 16GB –2x Quad Core

10 Prerequisites Nagios (install) DAG RPMs Cacti (install) Several RPMS Liberal network access to the cluster

11 Alerts & Escalations X nodes * Y Services = < Sleep Define a policy –Wake Me Ups (SMS) –Dont Wake Me Ups (EMAIL) –Review (Daily, Weekly, Monthly)

12 Wake Me Ups NameNode –Disk Full (Big Big Headache) –RAID Array Issues (failed disk) JobTrackerSecNameNode –Do not realize it is not working too late

13 Dont Wake Me Ups Or Wake someone else up DataNode –Warning Currently Failed Disk will down the Data Node (see Jira) TaskTracker TaskTracker Hardware Hardware –Bad Disk (Start RMA) Slaves are expendable (up to a point)

14 Monitoring Battle Plan Start With the Basics –Ping, Disk Add Hadoop Specific Alarms –check_data_node Add JMX Graphing –NameNodeOperations Add JMX Based alarms –FilesTotal > 1,000,000 or LiveNodes 1,000,000 or LiveNodes < 50%

15 The Basics Nagios Nagios (All Nodes) –Host up (Ping check) –Disk % Full –SWAP > 85 % * Load based alarms are somewhat useless 389% CPU load is not necessarily a bad thing in Hadoopville

16 The Basics Cacti Cacti (All Nodes) –CPU (full CPU) –RAM/SWAP –Network –Disk Usage

17 Disk Utilization

18 RAID Tools Hpacucli – not a Street Fighter move –Alerts on RAID events (NameNode) Disk failed Rebuilding –JBOD (DataNode) Failed Drive Drive Errors Dell, SUN, Vendor Specific Tools

19 Before you jump in X Nodes * Y Checks * = Lots of work About 3 Nodes into the process … –Wait!!! I need some interns!!! Solution S.I.C.C.T. Semi-Intelligent- Configuration-cloning-tools –(I made that up) –(for this presentation) –(for this presentation)

20 Nagios Answers IS IT RUNNING? Text based Configuration

21 Cacti Answers HOW WELL IS IT RUNNING? Web Based configuration –php-cli tools

22 Monitoring Battle Plan Thus Far Start With the Basics –Ping, Disk !!!!!!Done!!!!!! Add Hadoop Specific Alarms –check_data_node Add JMX Graphing –NameNodeOperations Add JMX Based alarms –FilesTotal > 1,000,000 or LiveNodes 1,000,000 or LiveNodes < 50%

23 Add Hadoop Specific Alarms Hadoop Components with a Web Interface –NameNode 50070 –JobTracker 50030 –TaskTracker 50060 –DataNode 50075 check_http + regex = simple + effective

24 nagios_check_commands.cfg Component Failure (Future) Newer Hadoop will have XML status define command { command_name check_remote_namenode command_line $USER1$/check_http -H $HOSTADDRESS$ -u http://$HOSTADDRESS$:$ARG1$/dfshealth.jsp -p $ARG1$ -r NameNode } define service { service_description check_remote_namenode use generic-service host_name hadoopname1 check_command check_remote_namenode!50070 }

25 Monitoring Battle Plan Start With the Basics –Ping, Disk (Done) Add Hadoop Specific Alarms –check_data_node (Done) Add JMX Graphing –NameNodeOperations Add JMX Based alarms –FilesTotal > 1,000,000 or LiveNodes 1,000,000 or LiveNodes < 50%

26 JMX Graphing Enable JMX Import Templates

27 JMX Graphing

28

29

30

31 Standard Java JMX

32 Monitoring Battle Plan Thus Far Start With the Basics !!!!!!Done!!!!! –Ping, Disk Add Hadoop Specific Alarms !Done! –check_data_node Add JMX Graphing !Done! –NameNodeOperations Add JMX Based alarms –FilesTotal > 1,000,000 or LiveNodes 1,000,000 or LiveNodes < 50%

33 Add JMX based Alarms hadoop-cacti-jtg is flexible –extend fetch classes –Dont call output() –Write your own check logic

34 Quick JMX Base Walkthrough url, user, pass, object specified from CLI wantedVariables, wantedOperations by inheritance fetch() output() provided

35 Extend for NameNode

36 Extend for Nagios

37 Monitoring Battle Plan Start With the Basics !DONE! –Ping, Disk Add Hadoop Specific Alarms !DONE! –check_data_node Add JMX Graphing !DONE! –NameNodeOperations Add JMX Based alarms !DONE! –FilesTotal > 1,000,000 or LiveNodes 1,000,000 or LiveNodes < 50%

38 Review File System Growth –Size –Number of Files –Number of Blocks –Ratios Utilization –CPU/Memory –Disk Email (nightly) –FSCK –DSFADMIN

39 The Future JMX Coming to JobTracker and TaskTracker (0.21) –Collect and Graph Jobs Running –Collect and Graph Map / Reduce per node –Profile Specific Jobs in Cacti?


Download ppt "How to monitor the $H!T out of Hadoop Developing a comprehensive open approach to monitoring hadoop clusters."

Similar presentations


Ads by Google