Pig : Building High-Level Dataflows over Map-Reduce

Pig : Building High-Level Dataflows over Map-Reduce
Utkarsh Srivastava Research & Cloud Computing

Data Processing Renaissance
Internet companies swimming in data E.g. TBs/day at Yahoo! Data analysis is “inner loop” of product innovation Data analysts are skilled programmers

Data Warehousing …? Often not scalable enough
Scale Prohibitively expensive at web scale Up to $200K/TB $ $ $ $ Little control over execution method Query optimization is hard Parallel environment Little or no statistics Lots of UDFs SQL

New Systems For Data Analysis
Map-Reduce Apache Hadoop Dryad . . .

Just a group-by-aggregate?
Map-Reduce Input records k1 v1 k2 v2 v3 k1 v1 v3 v5 Output records map reduce map k2 v4 k1 v5 k2 v2 v4 reduce Just a group-by-aggregate?

The Map-Reduce Appeal Scalable due to simpler design
Only parallelizable operations No transactions Scale $ Runs on cheap commodity hardware SQL Procedural Control- a processing “pipe”

Disadvantages M R M M R M 1. Extremely rigid data flow
Other flows constantly hacked in M M R M Join, Union Split Chains 2. Common operations must be coded by hand Join, filter, projection, aggregates, sorting, distinct 3. Semantics hidden inside map-reduce functions Difficult to maintain, extend, and optimize

Pros And Cons Need a high-level, general data flow language

Enter Pig Latin Need a high-level, general data flow language

Outline Map-Reduce and the need for Pig Latin Pig Latin
Compilation into Map-Reduce Example Generation Future Work

Example Data Analysis Task
Find the top 10 most visited pages in each category Visits Url Info User Url Time Amy cnn.com 8:00 bbc.com 10:00 flickr.com 10:05 Fred 12:00 Url Category PageRank cnn.com News 0.9 bbc.com 0.8 flickr.com Photos 0.7 espn.com Sports

Data Flow Load Visits Group by url Foreach url generate count
Load Url Info Join on url Group by category Foreach category generate top10 urls

In Pig Latin visits = load ‘/data/visits’ as (user, url, time);
gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(visits); urlInfo = load ‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’;

Step-by-step Procedural Control
Target users are entrenched procedural programmers The step-by-step method of creating a program in Pig is much cleaner and simpler to use than the single block method of SQL. It is easier to keep track of what your variables are, and where you are in the process of analyzing your data. Jasmine Novak Engineer, Yahoo! With the various interleaved clauses in SQL, it is difficult to know what is actually happening sequentially. With Pig, the data nesting and the temporary tables get abstracted away. Pig has fewer primitives than SQL does, but it’s more powerful. used to programming and or scripting languages David Ciemiewicz Search Excellence, Yahoo! Automatic query optimization is hard Pig Latin does not preclude optimization

Quick Start and Interoperability
visits = load ‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(urlVisits); urlInfo = load ‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’; Operates directly over files

Quick Start and Interoperability
visits = load ‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(urlVisits); urlInfo = load ‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’; Schemas optional; Can be assigned dynamically

User-Code as a First-Class Citizen
visits = load ‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(urlVisits); urlInfo = load ‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’; User-defined functions (UDFs) can be used in every construct Load, Store Group, Filter, Foreach

Nested Data Model Pig Latin has a fully-nestable data model with:
Atomic values, tuples, bags (lists), and maps More natural to programmers than flat tuples Avoids expensive joins yahoo , finance news

Decouples grouping as an independent operation
Nested Data Model Decouples grouping as an independent operation group Visits cnn.com Amy 8:00 Fred 12:00 bbc.com 10:00 10:05 User Url Time Amy cnn.com 8:00 bbc.com 10:00 10:05 Fred 12:00 group by url Common case: aggregation on these nested sets Power users: sophisticated UDFs, e.g., sequence analysis Efficient Implementation (see paper) I frankly like pig much better than SQL in some respects (group + optional ﬂatten works better for me, I love nested data structures).” Ted Dunning Chief Scientist, Veoh 19

CoGroup results revenue
query url rank Lakers nba.com 1 espn.com 2 Kings nhl.com query adSlot amount Lakers top 50 side 20 Kings 30 10 group results revenue Lakers nba.com 1 top 50 espn.com 2 side 20 Kings nhl.com 30 10 Cross-product of the 2 bags would give natural join

Implementation SQL Pig Pig is open-source.
user automatic rewrite + optimize Pig Pig is open-source. or or Hadoop Map-Reduce cluster ~50% of Hadoop jobs at Yahoo! are Pig 1000s of jobs per day

Compilation into Map-Reduce
Every group or join operation forms a map-reduce boundary Load Visits Group by url Reduce1 Map2 Foreach url generate count Load Url Info Join on url Reduce2 Map3 Other operations pipelined into map and reduce phases Group by category Foreach category generate top10(urls) Reduce3

Optimizations: Using the Combiner
Input records k1 v1 k2 v2 v3 k1 v1 v3 v5 Output records map reduce map k2 v4 k1 v5 k2 v2 v4 reduce Can pre-process data on the map-side to reduce data shipped Algebraic Aggregation Functions Distinct processing

Optimizations: Skew Join
Default join method is symmetric hash join. cross product carried out on 1 reducer group results revenue Lakers nba.com 1 top 50 espn.com 2 side 20 Kings nhl.com 30 10 Problem if too many values with same key Skew join samples data to find frequent values Further splits them among reducers

Optimizations: Fragment-Replicate Join
Symmetric-hash join repartitions both inputs If size(data set 1) >> size(data set 2) Just replicate data set 2 to all partitions of data set 1 Translates to map-only job Open data set 2 as “side file”

Optimizations: Merge Join
Exploit data sets are already sorted. Again, a map-only job Open other data set as “side file”

Optimizations: Multiple Data Flows
Map1 Load Users Filter bots Group by state Group by demographic Reduce1 Apply udfs Apply udfs Store into ‘bystate’ Store into ‘bydemo’

Optimizations: Multiple Data Flows
Map1 Load Users Filter bots Split Group by state Group by demographic Reduce1 Demultiplex Apply udfs Apply udfs Store into ‘bystate’ Store into ‘bydemo’

Other Optimizations Carry data as byte arrays as far as possible
Using binary comparator for sorting “Streaming” data through external executables

Performance

Example Dataflow Program
LOAD (user, url) LOAD (url, pagerank) JOIN on url Find users that tend to visit high-pagerank pages GROUP on user FOREACH user, canonicalize(url) say what canonicalize does, filter like having clause in SQL FOREACH user, AVG(pagerank) FILTER avgPR> 0.5

user, canonicalize(url)
Iterative Process LOAD (user, url) LOAD (url, pagerank) JOIN on url Joining on right attribute? GROUP on user FOREACH user, canonicalize(url) FOREACH user, AVG(pagerank) Bug in UDF canonicalize? Everything being filtered out? FILTER avgPR> 0.5 No Output ☹

How to do test runs? Run with real data
Too inefficient (TBs of data) Create smaller data sets (e.g., by sampling) Empty results due to joins [Chaudhuri et. al. 99], and selective filters Biased sampling for joins Indexes not always present cite surajit, motwani

Examples to Illustrate Program
( 0.9) ( 0.3) ( 0.4) LOAD (user, url) LOAD (url, pagerank) (Amy, cnn.com) (Amy, (Fred, JOIN on url (Amy, 0.9) (Amy, 0.3) (Fred, 0.4) GROUP on user FOREACH user, canonicalize(url) (Amy, 0.9) (Amy, 0.3) (Fred, 0.4) ( Amy, ) this is what someone would write by hand, or when teaching a class ( Fred, ) FOREACH user, AVG(pagerank) (Amy, (Amy, (Fred, (Amy, 0.6) (Fred, 0.4) FILTER avgPR> 0.5 (Amy, 0.6)

Value Addition From Examples
Examples can be used for Debugging Understanding a program written by someone else Learning a new operator, or language skip

Good Examples: Consistency
LOAD (user, url) LOAD (url, pagerank) (Amy, cnn.com) (Amy, (Fred, JOIN on url GROUP on user 0. Consistency FOREACH user, canonicalize(url) FOREACH user, AVG(pagerank) output example = operator applied on input example (Amy, (Amy, (Fred, FILTER avgPR> 0.5

Good Examples: Realism
LOAD (user, url) LOAD (url, pagerank) (Amy, cnn.com) (Amy, (Fred, JOIN on url GROUP on user 1. Realism FOREACH user, canonicalize(url) FOREACH user, AVG(pagerank) (Amy, (Amy, (Fred, FILTER avgPR> 0.5

Good Examples: Completeness
LOAD (user, url) LOAD (url, pagerank) 2. Completeness JOIN on url Demonstrate the salient properties of each operator, e.g., FILTER GROUP on user FOREACH user, canonicalize(url) FOREACH user, AVG(pagerank) (Amy, 0.6) (Fred, 0.4) FILTER avgPR> 0.5 (Amy, 0.6)

Good Examples: Conciseness
LOAD (user, url) LOAD (url, pagerank) 3. Conciseness (Amy, cnn.com) (Amy, (Fred, JOIN on url GROUP on user FOREACH user, canonicalize(url) FOREACH user, AVG(pagerank) (Amy, (Amy, (Fred, FILTER avgPR> 0.5

Implementation Status
Available as ILLUSTRATE command in open-source release of Pig Available as Eclipse Plugin (PigPen) See SIGMOD09 paper for algorithm and experiments

Related Work Sawzall Hive DryadLINQ Nested data models
Data processing language on top of map-reduce Rigid structure of filtering followed by aggregation Hive SQL-like language on top of Map-Reduce DryadLINQ SQL-like language on top of Dryad Nested data models Object-oriented databases

Future / In-Progress Tasks
Columnar-storage layer Metadata repository Profiling and Performance Optimizations Tight integration with a scripting language Use loops, conditionals, functions of host language Memory Management Project Suggestions at:

Credits

Sweet spot between map-reduce and SQL
Summary Big demand for parallel data processing Emerging tools that do not look like SQL DBMS Programmers like dataflow pipes over static files Hence the excitement about Map-Reduce But, Map-Reduce is too low-level and rigid Pig Latin Sweet spot between map-reduce and SQL

Pig : Building High-Level Dataflows over Map-Reduce

Similar presentations

Presentation on theme: "Pig : Building High-Level Dataflows over Map-Reduce"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Pig : Building High-Level Dataflows over Map-Reduce

Similar presentations

Presentation on theme: "Pig : Building High-Level Dataflows over Map-Reduce"— Presentation transcript:

Similar presentations

About project

Feedback