Pig : Building High-Level Dataflows over Map-Reduce

Pig : Building High-Level Dataflows over Map-Reduce

Data Processing Renaissance
Internet companies swimming in data E.g. TBs/day at Yahoo! Data analysis is “inner loop” of product innovation Data analysts are skilled programmers

Data Warehousing …? Often not scalable enough
Scale Prohibitively expensive at web scale Up to $200K/TB $ $ $ $ Little control over execution method Query optimization is hard Parallel environment Little or no statistics Lots of UDFs SQL

New Systems For Data Analysis
Map-Reduce Apache Hadoop Dryad . . .

Just a group-by-aggregate?
Map-Reduce Input records k1 v1 k2 v2 v3 k1 v1 v3 v5 Output records map reduce map k2 v4 k1 v5 k2 v2 v4 reduce Just a group-by-aggregate?

The Map-Reduce Appeal Scalable due to simpler design
Only parallelizable operations No transactions Scale $ Runs on cheap commodity hardware SQL Procedural Control- a processing “pipe”

Disadvantages M R M M R M 1. Extremely rigid data flow
Other flows constantly hacked in M M R M Join, Union Split Chains 2. Common operations must be coded by hand Join, filter, projection, aggregates, sorting, distinct 3. Semantics hidden inside map-reduce functions Difficult to maintain, extend, and optimize

Control over execution
Pros And Cons Need a high-level, general data flow language Scalable Cheap Control over execution Inflexible Lots of hand coding Semantics hidden

Control over execution
Enter Pig Latin Need a high-level, general data flow language Scalable Cheap Control over execution Pig Latin

Outline Map-Reduce and the need for Pig Latin Pig Latin
Compilation into Map-Reduce Example Generation Future Work

Pig Latin Example 1 Suppose we have a table
urls: (url, category, pagerank) Simple SQL query that finds, For each sufficiently large category, the average pagerank of high-pagerank urls in that category SELECT category, Avg(pagetank) FROM urls WHERE pagerank > 0.2 GROUP BY category HAVING COUNT(*) > 106

Equivalent Pig Latin program
good_urls = FILTER urls BY pagerank > 0.2; groups = GROUP good_urls BY category; big_groups = FILTER groups BY COUNT(good_urls) > 106 ; output = FOREACH big_groups GENERATE category, AVG(good_urls.pagerank);

Data Flow Filter good_urls by pagerank > 0.2 Group by category
Filter category by count > 106 Foreach category generate avg. pagerank

Pig Consist of a scripting language known as Pig Latin and Pig Latin Compiler. It is a high level scripting language used to write code to analyze data. Compiler converts the code into equivalent MapReduce code. Easier to write code in Pig compared to programming in Map Reduce. Pig has an optimizer that decides how to get the data quickly.

BENIFITS Ease of coding: Writes complex programs. It explicitly encodes the complex tasks involving inter-related data transformations, as data flow sequences. Optimization: Encodes the task in such a way that they can easily optimized for execution. This allows user to concentrate on the data processing aspects without bothering about efficiency. Extensibility: Allows to create own custom functions/user defined functions.

Why use Pig?

In MAP REDUCE

In Pig Latin

Example Pig Latin is Procedural
Pig Latin is procedural, it fits very naturally in the pipeline paradigm. SQL on the other hand is declarative.

Consider, for example, a simple pipeline, where data from sources users and clicks is to be joined and filtered, and then joined to data from a third source geoinfo and aggregated and finally stored into a able ValuableClicksPerDMA.

SQL Query SQL is declarative but not step-by-step style
insert into ValuableClicksPerDMA select dma, count(*) from geoinfo join ( select name, ipaddr from users join clicks on (users.name = clicks.user) where value > 0; ) using ipaddr group by dma; SQL is declarative but not step-by-step style

The Pig Latin for this will look like:
Users = load 'users' as (name, age, ipaddr); Clicks = load 'clicks' as (user, url, value); ValuableClicks = filter Clicks by value > 0; UserClicks = join Users by name, ValuableClicks by user; Geoinfo = load 'geoinfo' as (ipaddr, dma); UserGeo = join UserClicks by ipaddr, Geoinfo by ipaddr; ByDMA = group UserGeo by dma; ValuableClicksPerDMA = foreach ByDMA generate group, COUNT(UserGeo); store ValuableClicksPerDMA into 'ValuableClicksPerDMA'; Pig Latin is procedural (dataflow programming model) Step-by-step query style is much cleaner and easier to write

Example Data Analysis Task
Find the top 10 most visited pages in each category Visits Url Info User Url Time Amy cnn.com 8:00 bbc.com 10:00 flickr.com 10:05 Fred 12:00 Url Category PageRank cnn.com News 0.9 bbc.com 0.8 flickr.com Photos 0.7 espn.com Sports

Data Flow Load Visits Group by url Foreach url Load Url Info
generate count Load Url Info Join on url Group by category Foreach category generate top10 urls

Dataflow Language User specifies a sequence of steps where each step specifies only a single high-level data transformation The step-by-step method of creating a program in Pig is much cleaner and simpler to use than the single block method of SQL. It is easier to keep track of what your variables are, and where you are in the process of analyzing your data. Jasmine Novak Engineer, Yahoo!

Quick Start and Interoperability
visits = load ‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(urlVisits); urlInfo = load ‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’; Operates directly over files

Quick Start and Interoperability
visits = load ‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(urlVisits); urlInfo = load ‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’; Schemas optional; Can be assigned dynamically

User-Code as a First-Class Citizen
visits = load ‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(urlVisits); urlInfo = load ‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’; User-defined functions (UDFs) can be used in every construct Load, Store Group, Filter, Foreach

UDFs as First-Class Citizens
Used Defined Functions (UFDs) can be used in every construct Load, Store, Group, Filter, Foreach Example 2 Suppose we want to find for each category, the top 10 urls according to pagerank groups = GROUP urls BY category; output = FOREACH groups GENERATE category, top10(urls);

Data Model Tuple: A tuple is an ordered set of fields.
Example : (raja, 30) Bag: A bag is a collection of tuples. Example : {(raju,30),(Mohhammad,45)} Map: A Map is a set of key-value pairs. Example : [ ‘name’#’Raju’, ‘age’#30]

Nested Data Model Pig Latin has a fully-nestable data model with:
Atomic values, tuples, bags (lists), and maps More natural to programmers than flat tuples Avoids expensive joins yahoo , finance news

Pig Latin – Relational Operations

UDFs as First-Class Citizens
Used Defined Functions (UFDs) can be used in every construct Load, Store, Group, Filter, Foreach Example 2 Suppose we want to find for each category, the top 10 urls according to pagerank groups = GROUP urls BY category; output = FOREACH groups GENERATE category, top10(urls);

Decouples grouping as an independent operation
Nested Data Model Decouples grouping as an independent operation group Visits cnn.com Amy 8:00 Fred 12:00 bbc.com 10:00 10:05 User Url Time Amy cnn.com 8:00 bbc.com 10:00 10:05 Fred 12:00 group by url

CoGroup results revenue
query url rank Lakers nba.com 1 espn.com 2 Kings nhl.com query adSlot amount Lakers top 50 side 20 Kings 30 10 group results revenue Lakers nba.com 1 top 50 espn.com 2 side 20 Kings nhl.com 30 10 Cross-product of the 2 bags would give natural join

Implementation SQL Pig Pig is open-source.
user automatic rewrite + optimize Pig Pig is open-source. or or Hadoop Map-Reduce cluster ~50% of Hadoop jobs at Yahoo! are Pig 1000s of jobs per day

Building a Logical Plan
Pig interpreter first parse Pig Latin command, and verifies that the input files and bags being referred are valid Builds logical plan for every bag that the user defines Processing triggers only when user invokes a STORE command on a bag (at that point, the logical plan for that bag is compiled into physical plan and is executed)

Compilation into Map-Reduce
Every group or join operation forms a map-reduce boundary Load Visits Group by url Reduce1 Map2 Foreach url generate count Load Url Info Join on url Reduce2 Map3 Other operations pipelined into map and reduce phases Group by category Reduce3 Foreach category generate top10(urls)

Optimizations: Skew Join
Default join method is symmetric hash join. cross product carried out on 1 reducer group results revenue Lakers nba.com 1 top 50 espn.com 2 side 20 Kings nhl.com 30 10 Problem if too many values with same key Further splits them among reducers

Optimizations: Multiple Data Flows
Map1 Load Users Filter bots Group by state Group by demographic Reduce1 Apply udfs Apply udfs Store into ‘bystate’ Store into ‘bydemo’

Optimizations: Multiple Data Flows
Map1 Load Users Filter bots Split Group by state Group by demographic Reduce1 Demultiplex Apply udfs Apply udfs Store into ‘bystate’ Store into ‘bydemo’

Other Optimizations Carry data as byte arrays as far as possible
Using binary comparator for sorting “Streaming” data through external executables

Performance

Example Dataflow Program
LOAD (user, url) LOAD (url, pagerank) JOIN on url Find users that tend to visit high-pagerank pages GROUP on user FOREACH user, canonicalize(url) say what canonicalize does, filter like having clause in SQL FOREACH user, AVG(pagerank) FILTER avgPR> 0.5

user, canonicalize(url)
Iterative Process LOAD (user, url) LOAD (url, pagerank) JOIN on url Joining on right attribute? GROUP on user FOREACH user, canonicalize(url) Bug in UDF canonicalize? FOREACH user, AVG(pagerank) Everything being filtered out? FILTER avgPR> 0.5 No Output 

How to do test runs? Run with real data
Too inefficient (TBs of data) Create smaller data sets (e.g., by sampling) Empty results due to joins [Chaudhuri et. al. 99], and selective filters Biased sampling for joins Indexes not always present cite surajit, motwani

Examples to Illustrate Program
( 0.9) ( 0.3) ( 0.4) LOAD (user, url) LOAD (url, pagerank) (Amy, cnn.com) (Amy, (Fred, JOIN on url (Amy, 0.9) (Amy, 0.3) (Fred, 0.4) GROUP on user FOREACH user, canonicalize(url) (Amy, 0.9) (Amy, 0.3) (Fred, 0.4) ( Amy, ) this is what someone would write by hand, or when teaching a class ( Fred, ) FOREACH user, AVG(pagerank) (Amy, (Amy, (Fred, (Amy, 0.6) (Fred, 0.4) FILTER avgPR> 0.5 (Amy, 0.6)

Value Addition From Examples
Examples can be used for Debugging Understanding a program written by someone else Learning a new operator, or language skip

Good Examples: Consistency
LOAD (user, url) LOAD (url, pagerank) (Amy, cnn.com) (Amy, (Fred, JOIN on url GROUP on user 0. Consistency FOREACH user, canonicalize(url) FOREACH user, AVG(pagerank) output example = operator applied on input example (Amy, (Amy, (Fred, FILTER avgPR> 0.5

Good Examples: Realism
LOAD (user, url) LOAD (url, pagerank) (Amy, cnn.com) (Amy, (Fred, JOIN on url GROUP on user 1. Realism FOREACH user, canonicalize(url) FOREACH user, AVG(pagerank) (Amy, (Amy, (Fred, FILTER avgPR> 0.5

Good Examples: Completeness
LOAD (user, url) LOAD (url, pagerank) 2. Completeness JOIN on url Demonstrate the salient properties of each operator, e.g., FILTER GROUP on user FOREACH user, canonicalize(url) FOREACH user, AVG(pagerank) (Amy, 0.6) (Fred, 0.4) FILTER avgPR> 0.5 (Amy, 0.6)

Good Examples: Conciseness
LOAD (user, url) LOAD (url, pagerank) 3. Conciseness (Amy, cnn.com) (Amy, (Fred, JOIN on url GROUP on user FOREACH user, canonicalize(url) FOREACH user, AVG(pagerank) (Amy, (Amy, (Fred, FILTER avgPR> 0.5

Implementation Status
Available as ILLUSTRATE command in open-source release of Pig Available as Eclipse Plugin (PigPen) See SIGMOD09 paper for algorithm and experiments

Related Work Sawzall Hive DryadLINQ Nested data models
Data processing language on top of map-reduce Rigid structure of filtering followed by aggregation Hive SQL-like language on top of Map-Reduce DryadLINQ SQL-like language on top of Dryad Nested data models Object-oriented databases

Future / In-Progress Tasks
Columnar-storage layer Metadata repository Profiling and Performance Optimizations Tight integration with a scripting language Use loops, conditionals, functions of host language Memory Management Project Suggestions at:

Credits

Sweet spot between map-reduce and SQL
Summary Big demand for parallel data processing Emerging tools that do not look like SQL DBMS Programmers like dataflow pipes over static files Hence the excitement about Map-Reduce But, Map-Reduce is too low-level and rigid Pig Latin Sweet spot between map-reduce and SQL

Pig : Building High-Level Dataflows over Map-Reduce

Similar presentations

Presentation on theme: "Pig : Building High-Level Dataflows over Map-Reduce"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Pig : Building High-Level Dataflows over Map-Reduce

Similar presentations

Presentation on theme: "Pig : Building High-Level Dataflows over Map-Reduce"— Presentation transcript:

Similar presentations

About project

Feedback