Why is my Hadoop* job slow? Bikas Saha @bikassaha Hitesh Shah *Apache Hadoop, Falcon, Atlas, Tez, Sqoop, Flume, Kafka, Pig, Hive, HBase, Accumulo, Storm, Solr, Spark, Ranger, Knox, Ambari, ZooKeeper, Oozie, Zeppelin and the Hadoop elephant logo are trademarks of the Apache Software Foundation. Hortonworks: Powering the Future of Data
Agenda Metrics and Monitoring Logging and Correlation Tracing and Analysis Hortonworks: Powering the Future of Data
Metrics and Monitoring Metrics as high level pointers Ambari Metrics System Ambari Grafana Integration HBase, HDFS, YARN Dashboards Metrics based alerting Hortonworks: Powering the Future of Data
Metrics as high level pointers Machine level metrics like CPU load Application level metrics like HDFS counters Metrics at point of time Metrics anomalies along a time series Correlated anomalies Problem is to need to know what to look for
Ambari Metrics Service - Motivation First version released with Ambari 2.0.0 Limited Ganglia capabilities OpenTSDB – GPL license and needs a Hadoop cluster Need service level aggregation as well as time based Alerts based on metrics system Ability to scale past a 1000 nodes Ability to perform analytics based on a use case Allow fine grained control over aspects like: retention, collection intervals, aggregation Pluggable and Extensible
Ambari Grafana Integration Open source dashboard builder integrated with AMS. Available from Ambari-2.2.2 Pre-defined host level and service level (HDFS, HBase, Yarn etc) dashboards. Added to Ambari through API after upgrade
HBase Dashboard
HDFS Dashboard
YARN Dashboard
Metrics based Alerting Top N support to quickly identify potential offenders Alerting based on time series
Agenda Metrics and Monitoring Logging and Correlation Tracing and Analysis Hortonworks: Powering the Future of Data
Logging and Correlation HDFS, YARN Audit logs Caller Context YARN Application Timeline Service Lineage tracking of operations across workloads Ambari Log Search Hortonworks: Powering the Future of Data
HDFS Audit Logs and Caller Context FSNamesystem.audit: allowed=true ugi=userA (auth:SIMPLE) ip=/172.22.68.32 cmd=create src=/tmp/in/_temporary/1/_temporary/attempt_14644848874070_0009_m_009995_0/part-m-09995 dst=null perm=root:hdfs:rw-r--r-- proto=rpc callerContext=tez_ta:attempt_1464484887407_0009_1_00_009995_0 FSNamesystem.audit: allowed=true ugi=userA (auth:SIMPLE) ip=/172.22.68.33 cmd=create src=/tmp/in2/_temporary/1/_temporary/attempt_1464484887407_0011_m_000097_0/part-m-00097 dst=null perm=root:hdfs:rw-r--r-- proto=rpc callerContext=mr_attempt_1464484887407_0011_m_000097_0 FSNamesystem.audit: allowed=true ugi=userB (auth:SIMPLE) ip=/172.22.68.34 cmd=create src=/tmp/in2/_temporary/1/_temporary/attempt_1464484887407_0011_m_000095_0/part-m-00095 dst=null perm=root:hdfs:rw-r--r-- proto=rpc callerContext=mr_attempt_1464484887407_0011_m_000095_0 It is now possible to infer which application/job did what in HDFS Files created can be tracked down to the MR or Tez job and the specific task attempt that created them. Using simple string manipulation and aggregations, you can file jobs inducing high loads against the Namenode.
ResourceManager Audit Logs and Caller Context resourcemanager.RMAuditLogger: USER=userA IP=172.22.68.32 OPERATION=Submit Application Request TARGET=ClientRMService RESULT=SUCCESS APPID=application_1464484887407_0001 CALLERCONTEXT=PIG-pigSmoke.sh-8a052588-0013-4e39-83b1-ebad699d8e2e resourcemanager.RMAuditLogger: USER=userA IP=172.22.68.30 OPERATION=Submit Application Request TARGET=ClientRMService RESULT=SUCCESS APPID=application_1464484887407_0009 CALLERCONTEXT=CLI resourcemanager.RMAuditLogger: USER=userB IP=172.22.68.34 OPERATION=Submit Application Request TARGET=ClientRMService RESULT=SUCCESS APPID=application_1464484887407_0008 CALLERCONTEXT=mr_attempt_1464484887407_0007_m_000000_0 resourcemanager.RMAuditLogger: USER=userB IP=172.22.68.30 OPERATION=Submit Application Request TARGET=ClientRMService RESULT=SUCCESS APPID=application_1464484887407_0012 CALLERCONTEXT=HIVE_SSN_ID:f3aadf99-9e36-494b-84a1-99b685ac344b Tracking what YARN maps to what application type and instance is now much easier. It could made more easier if “mr_attempt_1464484887407_0007_m_000000_0” pointed to an oozie worklow instead of the MR job Who killed my application and how (command-line, webservice)?
YARN Application Timeline Service YARN service for fine grained application level tracing Enables complex metadata to be recorded as the YARN app makes progress Allows retrieval of this timeline data based on filters Can be used to drive limited online analytics and extensive post-hoc analysis
Lineage Tracking using YARN Timeline Timeline:8188/ws/v1/timeline/TEZ_DAG_ID/dag_1464484887407_0013_1 dagContext: { callerId: "root_20160529021115_006f8007-5840-4c64-9970-c1b506f68db2", callerType: "HIVE_QUERY_ID", context: "HIVE", description: "select user, count(visit_id) as visits from users group by user order by visits” } Timeline:8188/ws/v1/timeline/HIVE_QUERY_ID/root_20160529021115_006f8007- 5840-4c64-9970-c1b506f68db2 hiveContext: { callerId: “workflow_abcd", callerType: “OOZIE_ID", context: “OOZIE", description: “Daily ETL Summary Job” }
Ambari Log Search
Ambari Log Search
Agenda Metrics and Monitoring Logging and Correlation Tracing and Analysis Hortonworks: Powering the Future of Data
Tracing and Analysis Use Big Data methods to solve Big Data problems Apache Zeppelin as analytical tool Hive/Tez/YARN notebook for analysis Hortonworks: Powering the Future of Data
Zeppelin for Ad-hoc Analytics
YARN Analyzer
Tez Analyzer
Tez Analyzer
Tez Analyzer
Thank You Hortonworks: Powering the Future of Data