Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.

Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research

Data Processing Renaissance  Internet companies swimming in data E.g. TBs/day at Yahoo!  Data analysis is “inner loop” of product innovation  Data analysts are skilled programmers

Data Warehousing …? Scale Often not scalable enough $ $ Prohibitively expensive at web scale Up to $200K/TB SQL Little control over execution method Query optimization is hard Parallel environment Little or no statistics Lots of UDFs

New Systems For Data Analysis  Map-Reduce  Apache Hadoop  Dryad...

Map-Reduce Input records k1k1 v1v1 k2k2 v2v2 k1k1 v3v3 k2k2 v4v4 k1k1 v5v5 map k1k1 v1v1 k1k1 v3v3 k1k1 v5v5 k2k2 v2v2 k2k2 v4v4 Output records reduce Just a group-by-aggregate?

The Map-Reduce Appeal Scale Scalable due to simpler design Only parallelizable operations No transactions $ $ Runs on cheap commodity hardware Procedural Control- a processing “pipe” SQL

Disadvantages 1. Extremely rigid data flow Other flows constantly hacked in Join, Union Split M M R R M M M M R R M M Chains 2. Common operations must be coded by hand Join, filter, projection, aggregates, sorting, distinct 3. Semantics hidden inside map-reduce functions Difficult to maintain, extend, and optimize

Pros And Cons ScalableCheap Control over execution Inflexible Lots of hand coding Semantics hidden Need a high-level, general data flow language

Enter Pig Latin ScalableCheap Control over execution Pig Latin Need a high-level, general data flow language

Outline Map-Reduce and the need for Pig Latin Pig Latin example Salient features Implementation

Example Data Analysis Task UserUrlTime Amycnn.com8:00 Amybbc.com10:00 Amyflickr.com10:05 Fredcnn.com12:00 Find the top 10 most visited pages in each category UrlCategoryPageRank cnn.comNews0.9 bbc.comNews0.8 flickr.comPhotos0.7 espn.comSports0.9 VisitsUrl Info

Data Flow Load Visits Group by url Foreach url generate count Foreach url generate count Load Url Info Join on url Group by category Foreach category generate top10 urls Foreach category generate top10 urls

In Pig Latin visits = load ‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(visits); urlInfo = load ‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; // select * from visitCounts v,urlInfo u where v.url = u.url gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’;

Outline Map-Reduce and the need for Pig Latin Pig Latin example Salient features Implementation

Step-by-step Procedural Control Target users are entrenched procedural programmers The step-by-step method of creating a program in Pig is much cleaner and simpler to use than the single block method of SQL. It is easier to keep track of what your variables are, and where you are in the process of analyzing your data. Jasmine Novak Engineer, Yahoo! Automatic query optimization is hard Pig Latin does not preclude optimization With the various interleaved clauses in SQL, it is difficult to know what is actually happening sequentially. With Pig, the data nesting and the temporary tables get abstracted away. Pig has fewer primitives than SQL does, but it’s more powerful. David Ciemiewicz Search Excellence, Yahoo!

visits = load ‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(urlVisits); urlInfo = load ‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’; Quick Start and Interoperability Operates directly over files

visits = load ‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(urlVisits); urlInfo = load ‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’; Quick Start and Interoperability Schemas optional; Can be assigned dynamically Schemas optional; Can be assigned dynamically

visits = load ‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(urlVisits); urlInfo = load ‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’; User-Code as a First-Class Citizen User-defined functions (UDFs) can be used in every construct Load, Store Group, Filter, Foreach User-defined functions (UDFs) can be used in every construct Load, Store Group, Filter, Foreach

Pig Latin has a fully-nestable data model with: – Atomic values, tuples, bags (lists), and maps More natural to programmers than flat tuples Avoids expensive joins See paper Nested Data Model yahoo, finance email news

Outline Map-Reduce and the need for Pig Latin Pig Latin example Novel features Implementation

cluster Hadoop Map-Reduce Pig SQL automatic rewrite + optimize or user Pig is open-source. http://incubator.apache.org/pig Pig is open-source. http://incubator.apache.org/pig

Compilation into Map-Reduce Load Visits Group by url Foreach url generate count Foreach url generate count Load Url Info Join on url Group by category Foreach category generate top10(urls) Foreach category generate top10(urls) Map 1 Reduce 1 Map 2 Reduce 2 Map 3 Reduce 3 Every group or join operation forms a map-reduce boundary Other operations pipelined into map and reduce phases

Usage First production release about a year ago 150+ early adopters within Yahoo! Over 25% of the Yahoo! map-reduce user base

Related Work Sawzall – Data processing language on top of map-reduce – Rigid structure of filtering followed by aggregation DryadLINQ – SQL-like language on top of Dryad Nested data models – Object-oriented databases

Future Work Optional “safe” query optimizer – Performs only high-confidence rewrites User interface – Boxes and arrows UI – Promote collaboration, sharing code fragments and UDFs Tight integration with a scripting language – Use loops, conditionals of host language

Arun Murthy Pi Song Santhosh Srinivasan Amir Youssefi Shubham Chopra Alan Gates Shravan Narayanamurthy Olga Natkovich Credits

Summary Big demand for parallel data processing – Emerging tools that do not look like SQL DBMS – Programmers like dataflow pipes over static files Hence the excitement about Map-Reduce But, Map-Reduce is too low-level and rigid Pig Latin Sweet spot between map-reduce and SQL Pig Latin Sweet spot between map-reduce and SQL

Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.

Similar presentations

Presentation on theme: "Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.

Similar presentations

Presentation on theme: "Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research."— Presentation transcript:

Similar presentations

About project

Feedback