Pig, Making Hadoop Easy Alan F. Gates Yahoo!.

Slides:

Advertisements

Similar presentations

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.

Advertisements

Alan F. Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M. Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, Utkarsh.

How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations Thejas Nair pig Yahoo! Apache pig.

Pig Optimization and Execution Page 1 Alan F. © Hortonworks Inc

Hui Li Pig Tutorial Hui Li Some material adapted from slides by Adam Kawa the 3rd meeting of WHUG June 21, 2012.

Hadoop Pig By Ravikrishna Adepu.

Alan F. Gates Yahoo! Pig, Making Hadoop Easy Who Am I? Pig committer and PMC Member An architect in Yahoo! grid team Photo credit: Steven Guarnaccia,

CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.

© Hortonworks Inc Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop.

Alan F. Gates Yahoo! Pig, Making Hadoop Easy Who Am I? Pig committer Hadoop PMC Member An architect in Yahoo! grid team Or, as one coworker put.

High Level Language: Pig Latin Hui Li Judy Qiu Some material adapted from slides by Adam Kawa the 3 rd meeting of WHUG June 21, 2012.

CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture VII: 2014/04/21.

Pig Contributors Workshop Agenda Introductions What we are working on Usability Howl TLP Lunch Turing Completeness Workflow Fun (Bocci ball)

Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olsten, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Acknowledgement.

Parallel Computing MapReduce Examples Parallel Efficiency Assignment

Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.

CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.

Raghav Ayyamani. Copyright Ellis Horowitz, Why Another Data Warehousing System? Problem : Data, data and more data Several TBs of data everyday.

Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.

HADOOP ADMIN: Session -2

High Level Language: Pig Latin Hui Li Judy Qiu Some material adapted from slides by Adam Kawa the 3 rd meeting of WHUG June 21, 2012.

USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.

Pig Acknowledgement: Modified slides from Duke University 04/13/10 Cloud Computing Lecture.

Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.

Pig: Making Hadoop Easy Wednesday, June 10, 2009 Santa Clara Marriott.

Big Data Analytics Training

1 Accelerated Web Development Course JavaScript and Client side programming Day 2 Rich Roth On The Net

Making Hadoop Easy pig

Storage and Analysis of Tera-scale Data : 2 of Database Class 11/24/09

MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.

An Introduction to HDInsight June 27 th,

Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.

Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!

Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.

MAP-REDUCE ABSTRACTIONS 1. Abstractions On Top Of Hadoop We’ve decomposed some algorithms into a map-reduce “workflow” (series of map-reduce steps) –

Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.

Alan Gates Becoming a Pig Developer Who Am I? Pig committer Hadoop PMC Member Yahoo! architect for Pig.

Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.

What is Pig ???. Why Pig ??? MapReduce is difficult to program. It only has two phases. Put the logic at the phase. Too many lines of code even for simple.

Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.

MapReduce Compilers-Apache Pig

Mail call Us: / / Hadoop Training Sathya technologies is one of the best Software Training Institute.

Unit 5 Working with pig.

Data Virtualization Tutorial: Introduction to SQL Script

Distributed Programming in “Big Data” Systems Pramod Bhatotia wp

HADOOP ADMIN: Session -2

INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER

MSBIC Hadoop Series Processing Data with Pig

Spark Presentation.

Building Analytics At Scale With USQL and C#

Pig Latin - A Not-So-Foreign Language for Data Processing

MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner

Chapter 15 QUERY EXECUTION.

湖南大学-信息科学与工程学院-计算机与科学系

Pig Latin: A Not-So-Foreign Language for Data Processing

Introduction to PIG, HIVE, HBASE & ZOOKEEPER

Slides borrowed from Adam Shook

Overview of big data tools

The Idea of Pig Or Pig Concepts

CSE 491/891 Lecture 21 (Pig).

How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations Pig performance has been improving because of the optimizations.

CSE 491/891 Lecture 24 (Hive).

Charles Tappert Seidenberg School of CSIS, Pace University

Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper

(Hadoop) Pig Dataflow Language

Hadoop – PIG.

(Hadoop) Pig Dataflow Language

Pig and pig latin: An Introduction

Analysis of Structured or Semi-structured Data on a Hadoop Cluster

Presentation transcript:

Pig, Making Hadoop Easy Alan F. Gates Yahoo!

Who Am I? Pig committer Hadoop PMC Member An architect in Yahoo! grid team Or, as one coworker put it, “the lipstick on the Pig”

Who are you? How many have used Pig? How many have looked at it and have a basic understanding of it?

Motivation By Example Suppose you have user data in one file, website data in another, and you need to find the top 5 most visited pages by users aged 18 - 25. Load Users Load Pages Filter by age Join on name Group on url Count clicks Order by clicks Take top 5

In Map Reduce

In Pig Latin Users = load ‘users’ as (name, age); Fltrd = filter Users by age >= 18 and age <= 25; Pages = load ‘pages’ as (user, url); Jnd = join Fltrd by name, Pages by user; Grpd = group Jnd by url; Smmd = foreach Grpd generate group, COUNT(Jnd) as clicks; Srtd = order Smmd by clicks desc; Top5 = limit Srtd 5; store Top5 into ‘top5sites’;

Performance 0.1 0.2 0.3 0.4, 0.5 0.6, 0.7

Why not SQL? Data Collection Data Factory Pig Pipelines Iterative Processing Research Data Warehouse Hive BI Tools Analysis

Pig Highlights User defined functions (UDFs) can be written for column transformation (TOUPPER), or aggregation (SUM) UDFs can be written to take advantage of the combiner Four join implementations built in: hash, fragment-replicate, merge, skewed Multi-query: Pig will combine certain types of operations together in a single pipeline to reduce the number of times data is scanned Order by provides total ordering across reducers in a balanced way Writing load and store functions is easy once an InputFormat and OutputFormat exist Piggybank, a collection of user contributed UDFs

Who uses Pig for What? 70% of production jobs at Yahoo (10ks per day) Also used by Twitter, LinkedIn, Ebay, AOL, … Used to Process web logs Build user behavior models Process images Build maps of the web Do research on raw data sets

Submit a script directly Grunt, the pig shell Accessing Pig Submit a script directly Grunt, the pig shell PigServer Java class, a JDBC like interface

Components Job executes on cluster Pig resides on user machine Hadoop Cluster Pig resides on user machine User machine No need to install anything extra on your Hadoop cluster.

How It Works Pig Latin pig.jar: parses Execution Plan checks Map: A = LOAD ‘myfile’ AS (x, y, z); B = FILTER A by x > 0; C = GROUP B BY x; D = FOREACH A GENERATE x, COUNT(B); STORE D INTO ‘output’; pig.jar: parses checks optimizes plans execution submits jar to Hadoop monitors job progress Execution Plan Map: Filter Count Combine/Reduce: Sum

Demo s3://hadoopday/pig_tutorial Demo script: Show group query first, talk about: load and schema (none, declared, from data) data types data sources need not be from HDFS or even from files parallel clause, how parallelism is determined on maps how grouping works in Pig Latin So far what I’ve shown you is a simple join/group query. Now let’s look at something less straight forward in SQL Often people want to group data a number of different ways. Look at multiquery script: Note how there’s a branch in the logic now Often want to operate on the result of each record in a previous statement. Look at top5 query Note nested foreach allows you to operate on each record coming out of group by Since result of group by is a bag in each record, can apply operators to that bag Currently support order, distinct, filter, limit Use of flatten at the end Use of positional parameters There will always be logic you need to write that you can’t get from Pig Latin. This is where rich support of UDFs come in. Look at session query Note registering UDF UDF now called like any other Pig builtin function (in fact Pig builtins implemented as UDFs) Look at SessionAnalysis.java Class name is UDF name Input to UDF is always a Tuple, avoids need to declare expected input, means UDF has to check what it gets Talk about how projection of bags works Talk about how EvalFunc is templatized on return type Also easy to write load and store functions to fit your data needs

Upcoming Features In 0.8 (plan to branch end of August, release this fall): Runtime statistics collection UDFs in scripting languages (e.g. python) Ability to specify a custom partitioner Adding many string and math functions as Pig supported UDFs Post 0.8 Adding branches, loops, functions, and modules Usability Better error messages Fix ILLUSTRATE Improved integration with workflow systems

Learn More Read the online documentation: http://hadoop.apache.org/pig/ On line tutorials From Yahoo, http://developer.yahoo.com/hadoop/tutorial/ From Cloudera, http://www.cloudera.com/hadoop-training Using Pig on EC2: http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2728 A couple of Hadoop books available that include chapters on Pig, search at your favorite bookstore Join the mailing lists: pig-user@hadoop.apache.org for user questions pig-dev@hadoop.apache.com for developer issues howldev@yahoogroups.com for Howl