How to monitor the $H!T out of Hadoop

Slides:



Advertisements
Similar presentations
NAGIOS AND CACTI NETWORK MANAGEMENT AND MONITORING SYSTEMS.
Advertisements

How to monitor the $H!T out of Hadoop Developing a comprehensive open approach to monitoring hadoop clusters.
1 Network Monitoring with Nagios Asian Internet Interconnection Initiatives Project Yan Adikusuma Nara Institute of Science and Technology
The map and reduce functions in MapReduce are easy to test in isolation, which is a consequence of their functional style. For known inputs, they produce.
Bangkok, Thailand An Introduction intERLab at AIT Network Management Workshop March – Bangkok, Thailand Hervey Allen & Phil Regnauld.
MONITORING THE CONNECTICUT EDUCATION NETWORK Aliza Bailey 10/20/2010.
How to Monitor Ingres with Open Source Tools
Network & System Monitoring with Nagios & Cacti Kevin Mueller.
A Hadoop Overview. Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A.
Cacti Workshop Tony Roman Agenda What is Cacti? The Origins of Cacti Large Installation Considerations Automation The Current.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Integrating HADOOP with Eclipse on a Virtual Machine Moheeb Alwarsh January 26, 2012 Kent State University.
R. Lange, M. Giacchini: Monitoring a Control System Using Nagios Monitoring a Control System Using Nagios Ralph Lange, BESSY – Mauro Giacchini, LNL.
GROUP 7 TOOLS FOR BIG DATA Sandeep Prasad Dipojjwal Ray.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
Passive Monitoring with Nagios Jim Prins
1 Guide to Novell NetWare 6.0 Network Administration Chapter 13.
Josh Riggs Utilizing Open Source Network Monitoring.
Inventory:OCSNG + GLPI Monitoring: Zenoss 3
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
Weekly Report By: Devin Trejo Week of May 30, > June 5, 2015.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Whirlwind Tour of Hadoop Edward Capriolo Rev 2. Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from systems Batch Processing High.
Introduction to Hadoop Owen O’Malley Yahoo!, Grid Team
© Copyright 2013 TONE SOFTWARE CORPORATION New Client Turn-Up Overview.
Introduction To Nagios A Linux-based Monitoring System.
11 Distributed Monitoring for Web Apps Fernando Hönig
Network Monitoring Manage your business without blowing your budget. Learn how the Calhoun ISD utilizes free “Open Source” tools for real-time monitoring.
© 2008 Quest Software, Inc. ALL RIGHTS RESERVED. Perfmon and Profiler 101.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Apache Hadoop Daniel Lust, Anthony Taliercio. What is Apache Hadoop? Allows applications to utilize thousands of nodes while exchanging thousands of terabytes.
Youngil Kim Awalin Sopan Sonia Ng Zeng.  Introduction  Concept of the Project  System architecture  Implementation – HDFS  Implementation – System.
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
Hadoop Joshua Nester, Garrison Vaughan, Calvin Sauerbier, Jonathan Pingilley, and Adam Albertson.
Cloud Computing project NSYSU Sec. 1 Demo. NSYSU EE IT_LAB2 Outline  Our system’s architecture  Flow chart of the hadoop’s job(web crawler) working.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
Monitoring Alfresco with Nagios/Icinga Toni de la Fuente Alfresco Senior Solutions Engineer Blog: blyx.com
Queensland University of Technology Nagios – an Open Source monitoring solution and it’s deployment at QUT.
By: Joel Dominic and Carroll Wongchote 4/18/2012.
Alerting With MySQL and Nagios Sheeri Cabral Senior DB Admin/Architect
2008 Taipei, Taiwan An Introduction APRICOT 2008 Network Management Workshop February – Taipei, Taiwan Hervey Allen & Phil.
Nagios - introduction Dhruba Raj Bhandari ( CCNA ) p Additions by Phil Regnauld.
Nagios FTW TriLUG 8/10/06 Presented by: Jason Faulkner Ian Kilgore.
CSCF Cacti Project
10 Quick Steps To Disaster Mike Weber
Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.
'08 Rabat An Introduction AfNOG 2008 Network Management Workshop June 1-2 – Rabat, Morocco Hervey Allen & Phil Regnauld.
Network Management Workshop March – Bangkok, Thailand
Application or server monitoring
By Chris immanuel, Heym Kumar, Sai janani, Susmitha
Welcome to Indiana University Clusters
Unit 2 Hadoop and big data
What is nagios? Version 2 8/ M.A.Newhall.
How to download, configure and run a mapReduce program In a cloudera VM Presented By: Mehakdeep Singh Amrit Singh Chaggar Ranjodh Singh.
GWE Core Grid Wizard Enterprise (
DriveScale Proprietary Information © 2017
DriveScale Proprietary Information © 2016
Campus Monitoring Service
Introduction to HDFS: Hadoop Distributed File System
Virtualization in the gLite Grid Middleware software process
Chapter 8 I/O.
Hadoop Clusters Tess Fulkerson.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Challenges in Network Troubleshooting In big scale networks, when an issue like latency or packet drops occur its very hard sometimes to pinpoint.
The Basics of Apache Hadoop
Overview of big data tools
SENTRY SOFTWARE Extending BMC ProactiveNet Performance Management with
Presentation transcript:

How to monitor the $H!T out of Hadoop Developing a comprehensive open approach to monitoring hadoop clusters

Relevant Hadoop Information From 3 – 3000 Nodes Hardware/Software failures “common” Redundant Components DataNode, TaskTracker Non-redundant Components NameNode, JobTracker, SecondaryNameNode Fast Evolving Technology (Best Practices?)

Monitoring Software Nagios – Red Yellow Green Alerts, Escalations Defacto Standard – Widely deployed Text base configuration Web Interface Pluggable with shell scripts/external apps Return 0 - OK

Cacti Performance Graphing System RRD/RRA Front End Slick Web Interface Template System for Graph Types Pluggable SNMP input Shell script /external program

hadoop-cacti-jtg JMX Fetching Code w/ (kick off) scripts Cacti templates For Hadoop Premade Nagios Check Scripts Helper/Batch/automation scripts Apache License

Hadoop JMX

Sample Cluster P1 NameNode & SecNameNode JobTracker Hardware RAID 8 GB RAM 1x QUAD CORE DerbyDB (hive) on SecNameNode JobTracker 8GB RAM

A Sample Cluster p2 Slave (hadoopdata1-XXXX) JBOD 8x 1TB SATA Disk RAM 16GB 2x Quad Core

Prerequisites Nagios (install) DAG RPMs Cacti (install) Several RPMS Liberal network access to the cluster

Alerts & Escalations X nodes * Y Services = < Sleep Define a policy Wake Me Up’s (SMS) Don’t Wake Me Up’s (EMAIL) Review (Daily, Weekly, Monthly)

Wake Me Up’s NameNode JobTracker SecNameNode Disk Full (Big Big Headache) RAID Array Issues (failed disk) JobTracker SecNameNode Do not realize it is not working too late

Don’t Wake Me Up’s Or ‘Wake someone else up’ DataNode TaskTracker Warning Currently Failed Disk will down the Data Node (see Jira) TaskTracker Hardware Bad Disk (Start RMA) Slaves are expendable (up to a point)

Monitoring Battle Plan Start With the Basics Ping, Disk Add Hadoop Specific Alarms check_data_node Add JMX Graphing NameNodeOperations Add JMX Based alarms FilesTotal > 1,000,000 or LiveNodes < 50%

The Basics Nagios Nagios (All Nodes) Host up (Ping check) Disk % Full SWAP > 85 % * Load based alarms are somewhat useless 389% CPU load is not necessarily a bad thing in Hadoopville

The Basics Cacti Cacti (All Nodes) CPU (full CPU) RAM/SWAP Network Disk Usage

Disk Utilization

RAID Tools Hpacucli – not a Street Fighter move Alerts on RAID events (NameNode) Disk failed Rebuilding JBOD (DataNode) Failed Drive Drive Errors Dell, SUN, Vendor Specific Tools

Before you jump in X Nodes * Y Checks * = Lots of work About 3 Nodes into the process … Wait!!! I need some interns!!! Solution S.I.C.C.T. Semi-Intelligent-Configuration-cloning-tools (I made that up) (for this presentation)

Nagios Answers “IS IT RUNNING?” Text based Configuration

Cacti Answers “HOW WELL IS IT RUNNING?” Web Based configuration php-cli tools

Monitoring Battle Plan Thus Far Start With the Basics Ping, Disk !!!!!!Done!!!!!! Add Hadoop Specific Alarms check_data_node Add JMX Graphing NameNodeOperations Add JMX Based alarms FilesTotal > 1,000,000 or LiveNodes < 50%

Add Hadoop Specific Alarms Hadoop Components with a Web Interface NameNode 50070 JobTracker 50030 TaskTracker 50060 DataNode 50075 check_http + regex = simple + effective

nagios_check_commands.cfg Component Failure define command { command_name check_remote_namenode command_line $USER1$/check_http -H $HOSTADDRESS$ -u http://$HOSTADDRESS$:$ARG1$/dfshealth.jsp -p $ARG1$ -r NameNode } define service {                service_description            check_remote_namenode                use                             generic-service                host_name                       hadoopname1                check_command               check_remote_namenode!50070 } Component Failure (Future) Newer Hadoop will have XML status

Monitoring Battle Plan Start With the Basics Ping, Disk (Done) Add Hadoop Specific Alarms check_data_node (Done) Add JMX Graphing NameNodeOperations Add JMX Based alarms FilesTotal > 1,000,000 or LiveNodes < 50%

JMX Graphing Enable JMX Import Templates

JMX Graphing

JMX Graphing

JMX Graphing

Standard Java JMX

Monitoring Battle Plan Thus Far Start With the Basics !!!!!!Done!!!!! Ping, Disk Add Hadoop Specific Alarms !Done! check_data_node Add JMX Graphing !Done! NameNodeOperations Add JMX Based alarms FilesTotal > 1,000,000 or LiveNodes < 50%

Add JMX based Alarms hadoop-cacti-jtg is flexible extend fetch classes Don’t call output() Write your own check logic

Quick JMX Base Walkthrough url, user, pass, object specified from CLI wantedVariables, wantedOperations by inheritance fetch() output() provided

Extend for NameNode

Extend for Nagios

Monitoring Battle Plan Start With the Basics !DONE! Ping, Disk Add Hadoop Specific Alarms !DONE! check_data_node Add JMX Graphing !DONE! NameNodeOperations Add JMX Based alarms !DONE! FilesTotal > 1,000,000 or LiveNodes < 50%

Review File System Growth Utilization Email (nightly) Size Number of Files Number of Blocks Ratio’s Utilization CPU/Memory Disk Email (nightly) FSCK DSFADMIN

The Future JMX Coming to JobTracker and TaskTracker (0.21) Collect and Graph Jobs Running Collect and Graph Map / Reduce per node Profile Specific Jobs in Cacti?