Download presentation
Presentation is loading. Please wait.
Published byKarlee Ivy Modified over 10 years ago
1
Alan F. Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M. Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, Utkarsh Srivastava Building a High-Level Dataflow System on top of Map-Reduce: The Pig Experience
2
- 2 - What is Pig? Procedural dataflow language (Pig Latin) for Map-Reduce Provides standard relational transforms (group, join, filter, sort, etc.) Schemas are optional; if used, can be part of data or specified at run time User defined functions are first class citizens of the language
3
- 3 - An Example You have a dataset urls: (url, category, pagerank) You want to know the top 10 urls per category as measured by pagerank for sufficiently large categories: urls = load ‘dataset’ as (url, category, pagerank); grps = group urls by category; bgrps = filter grps by COUNT(urls) > 1000000; rslt = foreach bgrps generate group, top10(urls); store rslt into ‘myOutput’;
4
- 4 - Pig Latin = Sweet Spot between SQL & Map- Reduce SQLPigMap-Reduce Programming style Large blocks of declarative constraints “Plug together pipes” Built-in data manipulations Group-by, Sort, Join, Filter, Aggregate, Top-k, etc... Group-by, Sort Execution modelFancy; trust the query optimizer Simple, transparent Opportunities for automatic optimization Many Few (logic buried in map() and reduce()) Data SchemaMust be known at table creation Not required, may be defined at runtime
5
- 5 - Building Pig Type System and Type Inference Compilation to Map-Reduce Jobs Plan Execution Streaming; supporting user provided executables Performance Measurements Project Experience
6
- 6 - Map Reduce Overview
7
- 7 - From Pig Latin to Map Reduce Parser Script A = load B = filter C = group D = foreach Logical Plan Semantic Checks Logical Plan Logical Optimizer Logical Plan Logical to Physical Translator Physical Plan Physical To MR Translator MapReduce Launcher Jar to hadoop Map-Reduce Plan Logical Plan ≈ relational algebra Plan standard optimizations Physical Plan = physical operators to be executed Map-Reduce Plan = physical operators broken into Map, Combine, and Reduce stages
8
- 8 - Pig Latin to Logical Plan A = load ‘users’ as (user, age); B = load ‘pageviews’ as (user, url); C = filter A by age < 18; D = join A by user, B by user; E = group D by url; F = foreach E generate group, CalcScore(url); store F into ‘scored_urls’; Pig LatinLogical Plan load users load pageviews filter join group foreach store
9
- 9 - Group (tim, 17, yahoo.com) (tim, 17, ebay.com) (joe, 15, yahoo.com) D = join A by user, B by user; E = group D by url; F = foreach E generate group, CalcScore(url); join group foreach (yahoo.com, ) (tim, 17), (joe, 15) (ebay.com, (tim, 17) ) (yahoo.com, 0.95) (ebay.com, 0.90)
10
- 10 - Join join cogroup foreach (tim, 17, yahoo.com) (tim, 17, ebay.com) (joe, 15, yahoo.com) (tim, yahoo.com) (tim, ebay.com) (joe, yahoo.com) (tim, 17) (joe, 15) (bob, 11) (tim, (17) ) (joe, (15), (yahoo.com) ) (bob, (11), ) (yahoo.com) (ebay.com) load pageviews filter load pageviews filter
11
- 11 - Join Implementations Default is symmetric hash join Fragment-replicate for joining large and small inputs Merge join for joining inputs sorted on join key Skew join for handling inputs with significant skew in the join key
12
- 12 - Logical to Physical Plan Logical Plan load users load pageviews filter join group foreach store Physical Plan load users load pageviews filter local rearrange global rearrange foreach local rearrange global rearrange package foreach package store
13
- 13 - Physical to Map-Reduce Plan Physical Plan load users load pageviews filter local rearrange global rearrange foreach local rearrange global rearrange package foreach package store filter local rearrange foreach package local rearrange package foreach Map-Reduce Plan map reduce
14
- 14 - Sharing Scans load users filter out bots group by state group by demographic apply UDFs store into ‘bystate’ store into ‘bydemo’
15
- 15 - Multiple Group Map-Reduce Plan map filter local rearrange split local rearrange reduce multiplex package foreach
16
- 16 - Performance
17
- 17 - Questions
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.