Download presentation
Presentation is loading. Please wait.
Published byEzra Murphy Modified over 10 years ago
1
Alan F. Gates Yahoo! Pig, Making Hadoop Easy
2
- 2 - Who Am I? Pig committer Hadoop PMC Member An architect in Yahoo! grid team Or, as one coworker put it, “the lipstick on the Pig”
3
- 3 - Who are you?
4
- 4 - Motivation By Example Suppose you have user data in one file, website data in another, and you need to find the top 5 most visited pages by users aged 18 - 25. Load Users Load Pages Filter by age Join on name Group on url Count clicks Order by clicks Take top 5
5
- 5 - In Map Reduce
6
- 6 - In Pig Latin Users = load ‘users’ as (name, age); Fltrd = filter Users by age >= 18 and age <= 25; Pages = load ‘pages’ as (user, url); Jnd = join Fltrd by name, Pages by user; Grpd = group Jnd by url; Smmd = foreach Grpd generate group, COUNT(Jnd) as clicks; Srtd = order Smmd by clicks desc; Top5 = limit Srtd 5; store Top5 into ‘top5sites’;
7
- 7 - Performance 0.1 0.2 0.3 0.4, 0.5 0.6, 0.7
8
- 8 - Why not SQL? Data Collection Data Factory Pig Pipelines Iterative Processing Research Data Warehouse Hive BI Tools Analysis
9
- 9 - Pig Highlights User defined functions (UDFs) can be written for column transformation (TOUPPER), or aggregation (SUM) UDFs can be written to take advantage of the combiner Four join implementations built in: hash, fragment-replicate, merge, skewed Multi-query: Pig will combine certain types of operations together in a single pipeline to reduce the number of times data is scanned Order by provides total ordering across reducers in a balanced way Writing load and store functions is easy once an InputFormat and OutputFormat exist Piggybank, a collection of user contributed UDFs
10
- 10 - Who uses Pig for What? 70% of production jobs at Yahoo (10ks per day) Also used by Twitter, LinkedIn, Ebay, AOL, … Used to –Process web logs –Build user behavior models –Process images –Build maps of the web –Do research on raw data sets
11
- 11 - Accessing Pig Submit a script directly Grunt, the pig shell PigServer Java class, a JDBC like interface
12
- 12 - Components User machine Hadoop Cluster Pig resides on user machine Job executes on cluster No need to install anything extra on your Hadoop cluster.
13
- 13 - How It Works A = LOAD ‘myfile’ AS (x, y, z); B = FILTER A by x > 0; C = GROUP B BY x; D = FOREACH A GENERATE x, COUNT(B); STORE D INTO ‘output’; Pig Latin Execution Plan Map: Filter Count Combine/Reduce: Sum pig.jar: parses checks optimizes plans execution submits jar to Hadoop monitors job progress
14
- 14 - Demo s3://hadoopday/pig_tutorial
15
- 15 - Upcoming Features In 0.8 (plan to branch end of August, release this fall): –Runtime statistics collection –UDFs in scripting languages (e.g. python) –Ability to specify a custom partitioner –Adding many string and math functions as Pig supported UDFs Post 0.8 –Adding branches, loops, functions, and modules –Usability Better error messages Fix ILLUSTRATE –Improved integration with workflow systems
16
- 16 - Learn More Read the online documentation: http://hadoop.apache.org/pig/http://hadoop.apache.org/pig/ On line tutorials –From Yahoo, http://developer.yahoo.com/hadoop/tutorial/http://developer.yahoo.com/hadoop/tutorial/ –From Cloudera, http://www.cloudera.com/hadoop-traininghttp://www.cloudera.com/hadoop-training –Using Pig on EC2: http://developer.amazonwebservices.com/connect/entry.jspa?exte rnalID=2728 http://developer.amazonwebservices.com/connect/entry.jspa?exte rnalID=2728 A couple of Hadoop books available that include chapters on Pig, search at your favorite bookstore Join the mailing lists: –pig-user@hadoop.apache.org for user questionspig-user@hadoop.apache.org –pig-dev@hadoop.apache.com for developer issuespig-dev@hadoop.apache.com –howldev@yahoogroups.com for Howlhowldev@yahoogroups.com
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.