Download presentation
Presentation is loading. Please wait.
1
Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing
2
Data Processing Renaissance Internet companies swimming in data E.g. TBs/day at Yahoo! Data analysis is “inner loop” of product innovation Data analysts are skilled programmers
3
Data Warehousing …? Scale Often not scalable enough $ $ Prohibitively expensive at web scale Up to $200K/TB SQL Little control over execution method Query optimization is hard Parallel environment Little or no statistics Lots of UDFs
4
New Systems For Data Analysis Map-Reduce Apache Hadoop Dryad...
5
Map-Reduce Input records k1k1 v1v1 k2k2 v2v2 k1k1 v3v3 k2k2 v4v4 k1k1 v5v5 map k1k1 v1v1 k1k1 v3v3 k1k1 v5v5 k2k2 v2v2 k2k2 v4v4 Output records reduce Just a group-by-aggregate?
6
The Map-Reduce Appeal Scale Scalable due to simpler design Only parallelizable operations No transactions $ $ Runs on cheap commodity hardware Procedural Control- a processing “pipe” SQL
7
Disadvantages 1. Extremely rigid data flow Other flows constantly hacked in Join, Union Split M M R R M M M M R R M M Chains 2. Common operations must be coded by hand Join, filter, projection, aggregates, sorting, distinct 3. Semantics hidden inside map-reduce functions Difficult to maintain, extend, and optimize
8
Pros And Cons ScalableCheap Control over execution Inflexible Lots of hand coding Semantics hidden Need a high-level, general data flow language
9
Enter Pig Latin ScalableCheap Control over execution Pig Latin Need a high-level, general data flow language
10
Outline Map-Reduce and the need for Pig Latin Pig Latin Compilation into Map-Reduce Example Generation Future Work
11
Example Data Analysis Task UserUrlTime Amycnn.com8:00 Amybbc.com10:00 Amyflickr.com10:05 Fredcnn.com12:00 Find the top 10 most visited pages in each category UrlCategoryPageRank cnn.comNews0.9 bbc.comNews0.8 flickr.comPhotos0.7 espn.comSports0.9 VisitsUrl Info
12
Data Flow Load Visits Group by url Foreach url generate count Foreach url generate count Load Url Info Join on url Group by category Foreach category generate top10 urls Foreach category generate top10 urls
13
In Pig Latin visits = load ‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(visits); urlInfo = load ‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’;
14
Step-by-step Procedural Control Target users are entrenched procedural programmers The step-by-step method of creating a program in Pig is much cleaner and simpler to use than the single block method of SQL. It is easier to keep track of what your variables are, and where you are in the process of analyzing your data. Jasmine Novak Engineer, Yahoo! Automatic query optimization is hard Pig Latin does not preclude optimization With the various interleaved clauses in SQL, it is difficult to know what is actually happening sequentially. With Pig, the data nesting and the temporary tables get abstracted away. Pig has fewer primitives than SQL does, but it’s more powerful. David Ciemiewicz Search Excellence, Yahoo!
15
visits = load ‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(urlVisits); urlInfo = load ‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’; Quick Start and Interoperability Operates directly over files
16
visits = load ‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(urlVisits); urlInfo = load ‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’; Quick Start and Interoperability Schemas optional; Can be assigned dynamically Schemas optional; Can be assigned dynamically
17
visits = load ‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(urlVisits); urlInfo = load ‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’; User-Code as a First-Class Citizen User-defined functions (UDFs) can be used in every construct Load, Store Group, Filter, Foreach User-defined functions (UDFs) can be used in every construct Load, Store Group, Filter, Foreach
18
Pig Latin has a fully-nestable data model with: – Atomic values, tuples, bags (lists), and maps More natural to programmers than flat tuples Avoids expensive joins Nested Data Model yahoo, finance email news
19
Common case: aggregation on these nested sets Power users: sophisticated UDFs, e.g., sequence analysis Efficient Implementation (see paper) Nested Data Model Decouples grouping as an independent operation UserUrlTime Amycnn.com8:00 Amybbc.com10:00 Amybbc.com10:05 Fredcnn.com12:00 groupVisits cnn.com Amycnn.com8:00 Fredcnn.com12:00 bbc.com Amybbc.com10:00 Amybbc.com10:05 group by url I frankly like pig much better than SQL in some respects (group + optional flatten works better for me, I love nested data structures).” Ted Dunning Chief Scientist, Veoh 19
20
CoGroup queryurlrank Lakersnba.com1 Lakersespn.com2 Kingsnhl.com1 Kingsnba.com2 queryadSlotamount Lakerstop50 Lakersside20 Kingstop30 Kingsside10 groupresultsrevenue Lakers nba.com1Lakerstop50 Lakersespn.com2Lakersside20 Kings nhl.com1Kingstop30 Kingsnba.com2Kingsside10 resultsrevenue Cross-product of the 2 bags would give natural join
21
Outline Map-Reduce and the need for Pig Latin Pig Latin Compilation into Map-Reduce Example Generation Future Work
22
Implementation cluster Hadoop Map-Reduce Pig SQL automatic rewrite + optimize or user Pig is open-source. http://hadoop.apache.org/pig Pig is open-source. http://hadoop.apache.org/pig ~50% of Hadoop jobs at Yahoo! are Pig 1000s of jobs per day
23
Compilation into Map-Reduce Load Visits Group by url Foreach url generate count Foreach url generate count Load Url Info Join on url Group by category Foreach category generate top10(urls) Foreach category generate top10(urls) Map 1 Reduce 1 Map 2 Reduce 2 Map 3 Reduce 3 Every group or join operation forms a map-reduce boundary Other operations pipelined into map and reduce phases
24
Optimizations: Using the Combiner Input records k1k1 v1v1 k2k2 v2v2 k1k1 v3v3 k2k2 v4v4 k1k1 v5v5 map k1k1 v1v1 k1k1 v3v3 k1k1 v5v5 k2k2 v2v2 k2k2 v4v4 Output records reduce Can pre-process data on the map-side to reduce data shipped Algebraic Aggregation Functions Distinct processing
25
Optimizations: Skew Join Default join method is symmetric hash join. groupresultsrevenue Lakers nba.com1Lakerstop50 Lakersespn.com2Lakersside20 Kings nhl.com1Kingstop30 Kingsnba.com2Kingsside10 cross product carried out on 1 reducer Problem if too many values with same key Skew join samples data to find frequent values Further splits them among reducers
26
Optimizations: Fragment-Replicate Join Symmetric-hash join repartitions both inputs If size(data set 1) >> size(data set 2) – Just replicate data set 2 to all partitions of data set 1 Translates to map-only job – Open data set 2 as “side file”
27
Optimizations: Merge Join Exploit data sets are already sorted. Again, a map-only job – Open other data set as “side file”
28
Optimizations: Multiple Data Flows Load Users Filter bots Group by state Group by state Apply udfs Store into ‘bystate’ Group by demographic Group by demographic Apply udfs Store into ‘bydemo’ Map 1 Reduce 1
29
Optimizations: Multiple Data Flows Load Users Filter bots Group by state Group by state Apply udfs Store into ‘bystate’ Group by demographic Group by demographic Apply udfs Store into ‘bydemo’ Split Demultiplex Map 1 Reduce 1
30
Other Optimizations Carry data as byte arrays as far as possible Using binary comparator for sorting “Streaming” data through external executables
31
Performance
32
Outline Map-Reduce and the need for Pig Latin Pig Latin Compilation into Map-Reduce Example Generation Future Work
33
Example Dataflow Program LOAD (user, url) LOAD (url, pagerank) FOREACH user, canonicalize(url) JOIN on url GROUP on user FOREACH user, AVG(pagerank) FILTER avgPR> 0.5 Find users that tend to visit high-pagerank pages
34
Iterative Process LOAD (user, url) LOAD (url, pagerank) FOREACH user, canonicalize(url) JOIN on url GROUP on user FOREACH user, AVG(pagerank) FILTER avgPR> 0.5 Bug in UDF canonicalize? Joining on right attribute? Everything being filtered out? No Output
35
How to do test runs? Run with real data – Too inefficient (TBs of data) Create smaller data sets (e.g., by sampling) – Empty results due to joins [Chaudhuri et. al. 99], and selective filters Biased sampling for joins – Indexes not always present
36
Examples to Illustrate Program LOAD (user, url) LOAD (url, pagerank) FOREACH user, canonicalize(url) JOIN on url GROUP on user FOREACH user, AVG(pagerank) FILTER avgPR> 0.5 (Amy, cnn.com) (Amy, http://www.frogs.com) (Fred, www.snails.com/index.html) (Amy, www.cnn.com) (Amy, www.frogs.com) (Fred, www.snails.com) (www.cnn.com, 0.9) (www.frogs.com, 0.3) (www.snails.com, 0.4) (Amy, www.cnn.com, 0.9) (Amy, www.frogs.com, 0.3) (Fred, www.snails.com, 0.4) (Amy, 0.6) (Fred, 0.4) (Amy, 0.6) (Amy, www.cnn.com, 0.9) (Amy, www.frogs.com, 0.3) (Fred, www.snails.com, 0.4) ( Amy, ( Fred,) )
37
Value Addition From Examples Examples can be used for – Debugging – Understanding a program written by someone else – Learning a new operator, or language
38
Good Examples: Consistency LOAD (user, url) LOAD (url, pagerank) FOREACH user, canonicalize(url) JOIN on url GROUP on user FOREACH user, AVG(pagerank) FILTER avgPR> 0.5 0. Consistency output example = operator applied on input example (Amy, cnn.com) (Amy, http://www.frogs.com) (Fred, www.snails.com/index.html) (Amy, www.cnn.com) (Amy, www.frogs.com) (Fred, www.snails.com)
39
Good Examples: Realism LOAD (user, url) LOAD (url, pagerank) FOREACH user, canonicalize(url) JOIN on url GROUP on user FOREACH user, AVG(pagerank) FILTER avgPR> 0.5 1. Realism (Amy, cnn.com) (Amy, http://www.frogs.com) (Fred, www.snails.com/index.html) (Amy, www.cnn.com) (Amy, www.frogs.com) (Fred, www.snails.com)
40
Good Examples: Completeness LOAD (user, url) LOAD (url, pagerank) FOREACH user, canonicalize(url) JOIN on url GROUP on user FOREACH user, AVG(pagerank) FILTER avgPR> 0.5 Demonstrate the salient properties of each operator, e.g., FILTER 2. Completeness (Amy, 0.6) (Fred, 0.4) (Amy, 0.6)
41
Good Examples: Conciseness LOAD (user, url) LOAD (url, pagerank) FOREACH user, canonicalize(url) JOIN on url GROUP on user FOREACH user, AVG(pagerank) FILTER avgPR> 0.5 3. Conciseness (Amy, cnn.com) (Amy, http://www.frogs.com) (Fred, www.snails.com/index.html) (Amy, www.cnn.com) (Amy, www.frogs.com) (Fred, www.snails.com)
42
Implementation Status Available as ILLUSTRATE command in open-source release of Pig Available as Eclipse Plugin (PigPen) See SIGMOD09 paper for algorithm and experiments
43
Related Work Sawzall – Data processing language on top of map-reduce – Rigid structure of filtering followed by aggregation Hive – SQL-like language on top of Map-Reduce DryadLINQ – SQL-like language on top of Dryad Nested data models – Object-oriented databases
44
Future / In-Progress Tasks Columnar-storage layer Metadata repository Profiling and Performance Optimizations Tight integration with a scripting language – Use loops, conditionals, functions of host language Memory Management Project Suggestions at: http://wiki.apache.org/pig/ProposedProjects
45
Credits
46
Summary Big demand for parallel data processing – Emerging tools that do not look like SQL DBMS – Programmers like dataflow pipes over static files Hence the excitement about Map-Reduce But, Map-Reduce is too low-level and rigid Pig Latin Sweet spot between map-reduce and SQL Pig Latin Sweet spot between map-reduce and SQL
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.