Why is my Hadoop* job slow?

Slides:

Advertisements

Similar presentations

Big Data Training Course for IT Professionals Name of course : Big Data Developer Course Duration : 3 days full time including practical sessions Dates.

Advertisements

A Hadoop Overview. Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A.

© Hortonworks Inc Running Non-MapReduce Applications on Apache Hadoop Hitesh Shah & Siddharth Seth Hortonworks Inc. Page 1.

Why Spark on Hadoop Matters

An Information Architecture for Hadoop Mark Samson – Systems Engineer, Cloudera.

Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.

Hadoop Ecosystem Overview

TITLE SLIDE: HEADLINE Presenter name Title, Red Hat Date For Red Hat, it's 1994 all over again Sarangan Rangachari VP and GM, Storage and Big Data Red.

SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL.

Megabytes Gigabytes Terabytes Petabytes Purchase detail Purchase record Payment record Purchase detail Purchase record Payment record ERP CRM.

Megabytes Gigabytes Terabytes Petabytes Purchase detail Purchase record Payment record Purchase detail Purchase record Payment record ERP CRM.

Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.

Our Experience Running YARN at Scale Bobby Evans.

Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark Cluster Monitoring 2.

Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.

Enabling data management in a big data world Craig Soules Garth Goodson Tanya Shastri.

Data and SQL on Hadoop. Cloudera Image for hands-on Installation instruction – 2.

Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*

Matthew Winter and Ned Shawa

Stairway to the cloud or can we take the highway? Taivo Liik.

© Hortonworks Inc Hadoop: Beyond MapReduce Steve Loughran, Big Data workshop, June 2013.

Nov 2006 Google released the paper on BigTable.

Big Data Tools Hadoop S.S.Mulay Sr. V.P. Engineering February 1, 2013.

Spark and Jupyter 1 IT - Analytics Working Group - Luca Menichetti.

Part III BigData Analysis Tools (YARN) Yuan Xue

This is a free Course Available on Hadoop-Skills.com.

Data Science Hadoop YARN Rodney Nielsen. Rodney Nielsen, Human Intelligence & Language Technologies Lab Outline Classical Hadoop What’s it all about Hadoop.

Apache Hadoop on Windows Azure Avkash Chauhan

Microsoft Partner since 2011

Page 1 © Hortonworks Inc – All Rights Reserved Apache Hadoop - Virtualization Winter 2015 Version 1.4 Hortonworks. We do Hadoop.

Apache David Schneider (schnei21) ITEC400. What is Hadoop? Distributed Computing Open Source Reliable Scalable Fun Facts What is a Hadoop? Hadoop was.

Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.

Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit

Data Analytics Challenges Some faults cannot be avoided Decrease the availability for running physics Preventive maintenance is not enough Does not take.

Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.

The Big Data Network (phase 2) Cloud Hadoop system

Big Data & Test Automation

Pilot Kafka Service Manuel Martín Márquez. Pilot Kafka Service Manuel Martín Márquez.

OMOP CDM on Hadoop Reference Architecture

data & analytics beyond dashboards

PROTECT | OPTIMIZE | TRANSFORM

Database Services Katarzyna Dziedziniewicz-Wojcik On behalf of IT-DB.

Spark and YARN: Better Together

An Open Source Project Commonly Used for Processing Big Data Sets

Access Control Model for the Hadoop Ecosystem

Zhangxi Lin Texas Tech University

Chapter 10 Data Analytics for IoT

Hadoop in the Enterprise

Hadoop Developer.

MSBIC Hadoop Series Processing Data with Pig

How to Solve BigData Security Puzzle?

Hadoopla: Microsoft and the Hadoop Ecosystem

Apache Hadoop YARN: Yet Another Resource Manager

Hadoop Clusters Tess Fulkerson.

Microsoft Build /20/2018 5:17 AM © 2016 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY,

07 | Analyzing Big Data with Excel

Ministry of Higher Education

MIT 802 Introduction to Data Platforms and Sources Lecture 2

Visual Analytics Sandbox

Introduction to PIG, HIVE, HBASE & ZOOKEEPER

Data science and machine learning at scale, powered by Jupyter

Introduction to Apache

July, 2016 Fangshi Li Carl Steinbach Dr. LinkedIn July, 2016 Fangshi Li Carl Steinbach.

TIM TAYLOR AND JOSH NEEDHAM

A 5-minute overview of ADAudit Plus

Charles Tappert Seidenberg School of CSIS, Pace University

MIT 802 Introduction to Data Platforms and Sources Lecture 2

Big-Data Analytics with Azure HDInsight

Presentation transcript:

Why is my Hadoop* job slow? Bikas Saha @bikassaha Hitesh Shah *Apache Hadoop, Falcon, Atlas, Tez, Sqoop, Flume, Kafka, Pig, Hive, HBase, Accumulo, Storm, Solr, Spark, Ranger, Knox, Ambari, ZooKeeper, Oozie, Zeppelin and the Hadoop elephant logo are trademarks of the Apache Software Foundation. Hortonworks: Powering the Future of Data

Agenda Metrics and Monitoring Logging and Correlation Tracing and Analysis Hortonworks: Powering the Future of Data

Metrics and Monitoring Metrics as high level pointers Ambari Metrics System Ambari Grafana Integration HBase, HDFS, YARN Dashboards Metrics based alerting Hortonworks: Powering the Future of Data

Metrics as high level pointers Machine level metrics like CPU load Application level metrics like HDFS counters Metrics at point of time Metrics anomalies along a time series Correlated anomalies Problem is to need to know what to look for

Ambari Metrics Service - Motivation First version released with Ambari 2.0.0 Limited Ganglia capabilities OpenTSDB – GPL license and needs a Hadoop cluster Need service level aggregation as well as time based Alerts based on metrics system Ability to scale past a 1000 nodes Ability to perform analytics based on a use case Allow fine grained control over aspects like: retention, collection intervals, aggregation Pluggable and Extensible

Ambari Grafana Integration Open source dashboard builder integrated with AMS. Available from Ambari-2.2.2 Pre-defined host level and service level (HDFS, HBase, Yarn etc) dashboards. Added to Ambari through API after upgrade

HBase Dashboard

HDFS Dashboard

YARN Dashboard

Metrics based Alerting Top N support to quickly identify potential offenders Alerting based on time series

Agenda Metrics and Monitoring Logging and Correlation Tracing and Analysis Hortonworks: Powering the Future of Data

Logging and Correlation HDFS, YARN Audit logs Caller Context YARN Application Timeline Service Lineage tracking of operations across workloads Ambari Log Search Hortonworks: Powering the Future of Data

HDFS Audit Logs and Caller Context FSNamesystem.audit: allowed=true ugi=userA (auth:SIMPLE) ip=/172.22.68.32 cmd=create src=/tmp/in/_temporary/1/_temporary/attempt_14644848874070_0009_m_009995_0/part-m-09995 dst=null perm=root:hdfs:rw-r--r-- proto=rpc callerContext=tez_ta:attempt_1464484887407_0009_1_00_009995_0 FSNamesystem.audit: allowed=true ugi=userA (auth:SIMPLE) ip=/172.22.68.33 cmd=create src=/tmp/in2/_temporary/1/_temporary/attempt_1464484887407_0011_m_000097_0/part-m-00097 dst=null perm=root:hdfs:rw-r--r-- proto=rpc callerContext=mr_attempt_1464484887407_0011_m_000097_0 FSNamesystem.audit: allowed=true ugi=userB (auth:SIMPLE) ip=/172.22.68.34 cmd=create src=/tmp/in2/_temporary/1/_temporary/attempt_1464484887407_0011_m_000095_0/part-m-00095 dst=null perm=root:hdfs:rw-r--r-- proto=rpc callerContext=mr_attempt_1464484887407_0011_m_000095_0 It is now possible to infer which application/job did what in HDFS Files created can be tracked down to the MR or Tez job and the specific task attempt that created them. Using simple string manipulation and aggregations, you can file jobs inducing high loads against the Namenode.

ResourceManager Audit Logs and Caller Context resourcemanager.RMAuditLogger: USER=userA IP=172.22.68.32 OPERATION=Submit Application Request TARGET=ClientRMService RESULT=SUCCESS APPID=application_1464484887407_0001 CALLERCONTEXT=PIG-pigSmoke.sh-8a052588-0013-4e39-83b1-ebad699d8e2e resourcemanager.RMAuditLogger: USER=userA IP=172.22.68.30 OPERATION=Submit Application Request TARGET=ClientRMService RESULT=SUCCESS APPID=application_1464484887407_0009 CALLERCONTEXT=CLI resourcemanager.RMAuditLogger: USER=userB IP=172.22.68.34 OPERATION=Submit Application Request TARGET=ClientRMService RESULT=SUCCESS APPID=application_1464484887407_0008 CALLERCONTEXT=mr_attempt_1464484887407_0007_m_000000_0 resourcemanager.RMAuditLogger: USER=userB IP=172.22.68.30 OPERATION=Submit Application Request TARGET=ClientRMService RESULT=SUCCESS APPID=application_1464484887407_0012 CALLERCONTEXT=HIVE_SSN_ID:f3aadf99-9e36-494b-84a1-99b685ac344b Tracking what YARN maps to what application type and instance is now much easier. It could made more easier if “mr_attempt_1464484887407_0007_m_000000_0” pointed to an oozie worklow instead of the MR job  Who killed my application and how (command-line, webservice)?

YARN Application Timeline Service YARN service for fine grained application level tracing Enables complex metadata to be recorded as the YARN app makes progress Allows retrieval of this timeline data based on filters Can be used to drive limited online analytics and extensive post-hoc analysis

Lineage Tracking using YARN Timeline Timeline:8188/ws/v1/timeline/TEZ_DAG_ID/dag_1464484887407_0013_1 dagContext: { callerId: "root_20160529021115_006f8007-5840-4c64-9970-c1b506f68db2", callerType: "HIVE_QUERY_ID", context: "HIVE", description: "select user, count(visit_id) as visits from users group by user order by visits” } Timeline:8188/ws/v1/timeline/HIVE_QUERY_ID/root_20160529021115_006f8007- 5840-4c64-9970-c1b506f68db2 hiveContext: { callerId: “workflow_abcd", callerType: “OOZIE_ID", context: “OOZIE", description: “Daily ETL Summary Job” }

Ambari Log Search

Ambari Log Search

Agenda Metrics and Monitoring Logging and Correlation Tracing and Analysis Hortonworks: Powering the Future of Data

Tracing and Analysis Use Big Data methods to solve Big Data problems Apache Zeppelin as analytical tool Hive/Tez/YARN notebook for analysis Hortonworks: Powering the Future of Data

Zeppelin for Ad-hoc Analytics

YARN Analyzer

Tez Analyzer

Tez Analyzer

Tez Analyzer

Thank You Hortonworks: Powering the Future of Data