Putting Lipstick on Apache Pig Big Data Gurus Meetup August 14, 2013.

Slides:

Advertisements

Similar presentations

A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University

Advertisements

Alan F. Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M. Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, Utkarsh.

Pig Optimization and Execution Page 1 Alan F. © Hortonworks Inc

Hui Li Pig Tutorial Hui Li Some material adapted from slides by Adam Kawa the 3rd meeting of WHUG June 21, 2012.

Hadoop Pig By Ravikrishna Adepu.

Alan F. Gates Yahoo! Pig, Making Hadoop Easy Who Am I? Pig committer and PMC Member An architect in Yahoo! grid team Photo credit: Steven Guarnaccia,

CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.

Mapreduce and Hadoop Introduce Mapreduce and Hadoop

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.

© Hortonworks Inc Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop.

Alan F. Gates Yahoo! Pig, Making Hadoop Easy Who Am I? Pig committer Hadoop PMC Member An architect in Yahoo! grid team Or, as one coworker put.

EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.

Watching Pigs Fly with the Netflix Hadoop Toolkit Hadoop Summit 2013 San Jose, CA.

Working with pig Cloud computing lecture. Purpose  Get familiar with the pig environment  Advanced features  Walk though some examples.

High Level Language: Pig Latin Hui Li Judy Qiu Some material adapted from slides by Adam Kawa the 3 rd meeting of WHUG June 21, 2012.

CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture VII: 2014/04/21.

Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olsten, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Acknowledgement.

Parallel Computing MapReduce Examples Parallel Efficiency Assignment

Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.

(Hadoop) Pig Dataflow Language B. Ramamurthy Based on Cloudera’s tutorials and Apache’s Pig Manual 6/27/2015.

CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.

Hadoop Ecosystem Overview

Hive – A Warehousing Solution Over a Map-Reduce Framework Presented by: Atul Bohara Feb 18, 2014.

State of Connecticut Core-CT Project Query 4 hrs Updated 1/21/2011.

Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc

Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.

HADOOP ADMIN: Session -2

Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.

A Free sample background from © 2001 By Default!Slide 1.NET Overview BY: Pinkesh Desai.

SQL Server Integration Services (SSIS) Presented by Tarek Ghazali IT Technical Specialist Microsoft SQL Server (MVP) Microsoft Certified Technology Specialist.

Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.

Christopher Jeffers August 2012

Pig: Making Hadoop Easy Wednesday, June 10, 2009 Santa Clara Marriott.

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

Penwell Debug Intel Confidential BRIEF OVERVIEW OF HIVE Jonathan Brauer ESE 380L Feb

Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.

Introduction to Hadoop and HDFS

Storage and Analysis of Tera-scale Data : 2 of Database Class 11/24/09

MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.

An Introduction to HDInsight June 27 th,

Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.

Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!

Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

Ex Libris Developers Network Develop. Experiment. Collaborate.

ApproxHadoop Bringing Approximations to MapReduce Frameworks

Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies

Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.

Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.

What is Pig ???. Why Pig ??? MapReduce is difficult to program. It only has two phases. Put the logic at the phase. Too many lines of code even for simple.

ConTZole Tomáš Kubeš, 2010 atlas-tz-monitoring.cern.ch An Interactive ATLAS Tier-0 Monitoring.

BIG DATA. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database.

Apache Tez : Accelerating Hadoop Query Processing Page 1.

Microsoft Ignite /28/2017 6:07 PM

MapReduce Compilers-Apache Pig

Pig, Making Hadoop Easy Alan F. Gates Yahoo!.

INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER

Spark Presentation.

Nipa Das, Ye Jee Kim, Murphy Potts, Sadaf Mirzai

Introduction to Spark.

Introduction to PIG, HIVE, HBASE & ZOOKEEPER

Server & Tools Business

Introduction to Apache

Overview of big data tools

Charles Tappert Seidenberg School of CSIS, Pace University

(Hadoop) Pig Dataflow Language

ETL Patterns in the Cloud with Azure Data Factory

Server & Tools Business

(Hadoop) Pig Dataflow Language

Presentation transcript:

Putting Lipstick on Apache Pig Big Data Gurus Meetup August 14, 2013

Data should be accessible, easy to discover, and easy to process for everyone. Motivation

Big Data Users at Netflix Analysts Engineers Desires Self Service Easy Rich ToolsetRich APIs A Single Platform / Data Architecture that Serves Both Groups

Netflix Data Warehouse - Storage S3 is the source of truth Decouples storage from processing. Persistent data; multiple/ transient Hadoop clusters Data sources Event data from cloud services via Ursula/Honu Dimension data from Cassandra via Aegisthus ~100 billion events processed / day Petabytes of data persisted and available to queries on S3.

Netflix Data Platform - Processing Long running clusters sla and ad-hoc Supplemental nightly bonus clusters For high priority ETL jobs 2,000+ instances in aggregate across the clusters

Netflix Hadoop Platform as a Service S3

Netflix Data Platform – Primitive Service Layer Primitive, decoupled services Building blocks for more complicated tools/services/apps Serves 1000s of MapReduce Jobs / day 100+ jobs concurrently

Netflix Data Platform – Tools Sting (Adhoc Visualization) Looper (Backloading) Forklift (Data Movement) Ignite (A/B Test Analytics) Lipstick (Workflow Visualization) Spock (Data Auditing) Heavily utilize services in the primitive layer. Follow the same design philosophy as primitive apps: RESTful API Decoupled javascript interfaces

Pig and Hive at Netflix Hive – AdHoc queries – Lightweight aggregation Pig – Complex Dataflows / ETL – Data movement “glue” between complex operations

What is Pig? A data flow language Simple to learn – Very few reserved words – Comparable to a SQL logical query plan Easy to extend and optimize Extendable via UDFs written in multiple languages – Java, Python, Ruby, Groovy, Javascript

Sample Pig Script* (Word Count) input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet' AS (line:chararray); -- Extract words from each line and put them into a pig bag -- datatype, then flatten the bag to get one word on each row words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word; -- filter out any words that are just white spaces filtered_words = FILTER words BY word MATCHES '\\w+'; -- create a group for each word word_groups = GROUP filtered_words BY word; -- count the entries in each group word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word; -- order the records by count ordered_word_count = ORDER word_count BY count DESC; STORE ordered_word_count INTO '/tmp/number-of-words-on-internet'; *

A Typical Pig Script

Pig… Data flows are easy & flexible to express in text – Facilitates code reuse via UDFs and macros – Allows logical grouping of operations vs grouping by order of execution. – But errors are easy to make and overlook. Scripts can quickly get complicated Visualization quickly draws attention to: – Common errors – Execution order / logical flow – Optimization opportunities

Lipstick Generates graphical representations of Pig data flows. Compatible with Apache Pig v11+ Has been used to monitor more than 25,000 Pig jobs at Netflix

Lipstick

Overall Job Progress

Logical Plan Overall Job Progress

Logical Operator (reduce side) Logical Operator (map side) Map/Reduce Job Intermediate Row Count Records Loaded

Hadoop Counters

Lipstick for Fast Development During development: – Keep track of data flow – Spot common errors Omitted (hanging) operators Data type issues – Easily estimate and optimize complexity Number of MR jobs generated Map only vs full Map/Reduce jobs Opportunities to rejigger logic to: – Combine multiple jobs into a single job – Manipulate execution order to achieve better parallelism (e.g. less blocking)

Lipstick for Job Monitoring During execution: – Graphically monitor execution status from a single console – Spot optimization opportunities Map vs reduce side joins Data skew Better parallelism settings

Lipstick for Support Empowers users to support themselves – Better operational visibility What is my script currently doing? Why is my script slow? – Examine intermediate output of jobs – All execution information in one place Facilitates communication between infrastructure / support teams and end users – Lipstick link contains all information needed to provide support.

Lipstick Architecture Pig v11+ lipstick-console.jar Lipstick Server (RESTful Grails app) Javascript Client (Frontend GUI) RDS Persistence RDS Persistence

Lipstick Architecture - Console Implements PigProgressNotificationListener interface Listens for: 1.New statements to be registered (unoptimized plan) 2.Script launched event (optimized, physical, M/R plan) 3.MR Job completion/failure event 4.Heartbeat progress (during execution) Pig Plans and Progress  Lipstick objects Communicates with Lipstick Server

Pig Compilation Plans Optimized Logical Plan Physical Plan MapReduce Plan (grouping of Physical Operators into map or reduce jobs) MapReduce Plan (grouping of Physical Operators into map or reduce jobs) Pig Script Unoptimized Logical Plan (~1:1 logical operator / line of Pig) Unoptimized Logical Plan (~1:1 logical operator / line of Pig) Lipstick associates Logical Operators with MapReduce jobs by inferring relationships between Logical and Physical Operations.

Lipstick Architecture - Server Simple REST interface It’s a Grails app! Pig client posts plans and puts progress Javascript client gets plans and progress Searches jobs by job name and user name

Lipstick Architecture – JS Client Displays and annotates graphs with status / progress Completely decoupled from Server Event based design Periodically polls Server for job progress Usability is a key focus

My Job has stalled. Solving Problems with Lipstick - Common Problem #1

Unoptimized/Optimized Logical Plan Toggle Dangling Operator

I didn’t get the data I was expecting Common Problem #2

I don’t understand why my job failed. Common Problem #3

Failed Job (light red background) Successful Job (light blue background)

Future of Lipstick Annotate common errors and inefficiencies on the graph – Skew / map side join opportunities / scalar issues – E.g. Warnings / error dashboard Provide better details of runtime performance – Timings annotated on graph – Min / median / max mapper and reducer times – Map / reduce completion over time Search through execution history – Examine trends in runtime and data volumes – History of failure / success Search jobs for commonalities – Common datasets loaded / saved – Better grasp data lineage – Common uses of UDFs and macros

Lipstick on Hive Honey?

A closer look…

Wrapping up Lipstick is part of Netflix OSS. Clone it on github at Check out the quickstart guide – Started#1-quick-start Started#1-quick-start – Get started playing with Lipstick in under 5 minutes! We happily welcome your feedback and contributions!

 Jeff Magnusson: | Thank you! Jobs: Netflix OSS: Tech Blog: