Download presentation
Presentation is loading. Please wait.
Published byMarisa Kellam Modified over 9 years ago
1
How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations http://pig.apache.org Thejas Nair pig team @ Yahoo! Apache pig PMC member
2
What is Pig? Pig Latin, a high level data processing language. An engine that executes Pig Latin locally or on a Hadoop cluster. Pig-latin-cup pic from http://www.flickr.com/photos/frippy/2507970530/
3
Pig Latin example Users = load ‘users’ as (name, age); Fltrd = filter Users by age >= 18 and age <= 25; Pages = load ‘pages’ as (user, url); Jnd = join Fltrd by name, Pages by user;
4
Comparison with MR in Java 1/20 the lines of code1/16 the development time What about Performance ?
5
Pig Compared to Map Reduce Faster development time Data flow versus programming logic Many standard data operations (e.g. join) included Manages all the details of connecting jobs and data flow Copes with Hadoop version change issues
6
And, You Don’t Lose Power UDFs can be used to load, evaluate, aggregate, and store data External binaries can be invoked Metadata is optional Flexible data model Nested data types Explicit data flow programming
7
Pig performance Pigmix : pig vs mapreduce
8
Pig optimization principles vs RDBMS: There is absence of accurate models for data, operators and execution env Use available reliable info. Trust user choice. Use rules that help in most cases Rules based on runtime information
9
Logical Optimizations Restructure given logical dataflow graph Apply filter, project, limit early Merge foreach, filter statements Operator rewrites Script A = load B = foreach C = filter Logical Plan A -> B -> C Parser Logical Optimizer Optimized L. Plan A -> C -> B
10
Physical Optimizations Physical plan: sequence of MR jobs having physical operators. Built-in rules. eg. use of combiner Specified in query - eg. join type Optimized L. Plan X -> Y -> Z Optimizer Phy/MR plan M(PX-PYm) R(PYr) -> M(Z) Optimized Phy/MR Plan M(PX-PYm) C(PYc)R(PYr) -> M(Z) Translator
11
Hash Join Pages Users Users = load ‘users’ as (name, age); Pages = load ‘pages’ as (user, url); Jnd = join Users by name, Pages by user; Map 1 Pages block n Pages block n Map 2 Users block m Users block m Reducer 1 Reducer 2 (1, user) (2, name) (1, fred) (2, fred) (1, jane) (2, jane)
12
Skew Join Pages Users Users = load ‘users’ as (name, age); Pages = load ‘pages’ as (user, url); Jnd = join Pages by user, Users by name using ‘skewed’; Map 1 Pages block n Pages block n Map 2 Users block m Users block m Reducer 1 Reducer 2 (1, user) (2, name) (1, fred, p1) (1, fred, p2) (2, fred) (1, fred, p3) (1, fred, p4) (2, fred) SPSP SPSP SPSP SPSP
13
Merge Join Pages Users aaron. zach aaron. zach Users = load ‘users’ as (name, age); Pages = load ‘pages’ as (user, url); Jnd = join Pages by user, Users by name using ‘merge’; Map 1 Map 2 Users Pages aaron… amr aaron … amy… barb amy …
14
Replicated Join Pages Users aaron. zach aaron. zach Users = load ‘users’ as (name, age); Pages = load ‘pages’ as (user, url); Jnd = join Pages by user, Users by name using ‘replicated’; Map 1 Map 2 Users Pages aaron… amr aaron. zach amy… barb Users aaron. zach
15
Group/cogroup optimizations On sorted and ‘collected’ data grp = group Users by name using ‘collected’; Pages aaron barney carol. zach Map 1 aaron barney Map 2 carol.
16
Multi-store script A = load ‘users’ as (name, age, gender, city, state); B = filter A by name is not null; C1 = group B by age, gender; D1 = foreach C1 generate group, COUNT(B); store D into ‘bydemo’; C2= group B by state; D2 = foreach C2 generate group, COUNT(B); store D2 into ‘bystate’; A: load B: filter C2: group C1: group C3: eval udf C2: eval udf store into ‘bystate’ store into ‘bydemo’
17
Multi-Store Map-Reduce Plan map filter local rearrange split local rearrange reduce multiplex package foreach
18
Memory Management Use disk if large objects don’t fit into memory JVM limit > phy mem - Very poor performance Spill on memory threshold notification from JVM - unreliable pre-set limit for large bags. Custom spill logic for different bags -eg distinct bag.
19
Other optimizations Aggressive use of combiner, secondary sort Lazy deserialization in loaders Better serialization format Faster regex lib, compiled pattern
20
Future optimization work Improve memory management Join + group in single MR, if same keys used Even better skew handling Adaptive optimizations Automated hadoop tuning …
21
Pig - fast and flexible More flexibility in 0.8, 0.9 Udfs in scripting languages (python) MR job as relation Relation as scalar Turing complete pig (0.9) Pic courtesy http://www.flickr.com/photos/shutterbc/471935204/
22
Further reading Docs - http://pig.apache.org/docs/r0.7.0/ Papers and talks - http://wiki.apache.org/pig/PigTalksPaper s http://wiki.apache.org/pig/PigTalksPaper s Training videos in vimeo.com (search ‘hadoop pig’)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.