The Pig Latin Dataflow Language A Brief Overview James Jolly University of Wisconsin-Madison

Slides:



Advertisements
Similar presentations
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Advertisements

Alan F. Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M. Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, Utkarsh.
Hui Li Pig Tutorial Hui Li Some material adapted from slides by Adam Kawa the 3rd meeting of WHUG June 21, 2012.
Hadoop Pig By Ravikrishna Adepu.
Your Name.  Recap  Advance  Built-In Function  UDF  Conclusion.
CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Developing a MapReduce Application – packet dissection.
High Level Language: Pig Latin Hui Li Judy Qiu Some material adapted from slides by Adam Kawa the 3 rd meeting of WHUG June 21, 2012.
Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olsten, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Acknowledgement.
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
Hive - A Warehousing Solution Over a Map-Reduce Framework.
Spark: Cluster Computing with Working Sets
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
The Hadoop Stack, Part 1 Introduction to Pig Latin CSE – Cloud Computing – Fall 2014 Prof. Douglas Thain University of Notre Dame.
(Hadoop) Pig Dataflow Language B. Ramamurthy Based on Cloudera’s tutorials and Apache’s Pig Manual 6/27/2015.
Pig Latin Olston, Reed, Srivastava, Kumar, and Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. SIGMOD Shahram Ghandeharizadeh.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.
HADOOP ADMIN: Session -2
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
MapReduce.
Pig Acknowledgement: Modified slides from Duke University 04/13/10 Cloud Computing Lecture.
MapReduce VS Parallel DBMSs
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
Big Data Analytics Training
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Pig Latin CS 6800 Utah State University. Writing MapReduce Jobs Higher order functions Map applies a function to a list Example list [1, 2, 3, 4] Want.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Cloud Distributed Computing Platform 2 Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
Storage and Analysis of Tera-scale Data : 2 of Database Class 11/24/09
Databases From A to Boyce Codd. What is a database? It depends on your point of view. For Manovich, a database is a means of structuring information in.
MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.
Restore : Reusing results of mapreduce jobs Jun Fan.
An Introduction to HDInsight June 27 th,
A NoSQL Database - Hive Dania Abed Rabbou.
Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Duke University.
RESTORE IMPLEMENTATION as an extension to pig Vijay S.
Presented by Priagung Khusumanegara Prof. Kyungbaek Kim
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Bi-Hadoop: Extending Hadoop To Improve Support For Binary-Input Applications Xiao Yu and Bo Hong School of Electrical and Computer Engineering Georgia.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
Spark. Spark ideas expressive computing system, not limited to map-reduce model facilitate system memory – avoid saving intermediate results to disk –
Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.
MapReduce and Data Management Based on slides from Jimmy Lin’s lecture slides ( (licensed.
Design of Pig B. Ramamurthy. Pig’s data model Scalar types: int, long, float (early versions, recently float has been dropped), double, chararray, bytearray.
Pig, a high level data processing system on Hadoop Gang Luo Nov. 1, 2010.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.
What is Pig ???. Why Pig ??? MapReduce is difficult to program. It only has two phases. Put the logic at the phase. Too many lines of code even for simple.
Apache Tez : Accelerating Hadoop Query Processing Page 1.
MapReduce Compilers-Apache Pig
Pig, Making Hadoop Easy Alan F. Gates Yahoo!.
Hadoop.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Pig Latin - A Not-So-Foreign Language for Data Processing
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Cloud Distributed Computing Environment Hadoop
Pig Latin: A Not-So-Foreign Language for Data Processing
Overview of big data tools
The Idea of Pig Or Pig Concepts
CSE 491/891 Lecture 21 (Pig).
TIM TAYLOR AND JOSH NEEDHAM
Charles Tappert Seidenberg School of CSIS, Pace University
04 | Processing Big Data with Pig
Map Reduce, Types, Formats and Features
Presentation transcript:

The Pig Latin Dataflow Language A Brief Overview James Jolly University of Wisconsin-Madison

What is Pig Latin? set-oriented data transformation language –primitives filter, combine, split, and order data –users describe transformations in steps –steps bundled into queries –each set transformation is stateless flexible data model –nested bags of tuples –semi-structured datatypes extensible –supports user-defined functions 2

How is it used in practice? useful for computations across large, distributed datasets –abstracts away details of execution framework –users can change order of steps to improve performance often used in tandem with Hadoop and HDFS –transformations converted to MapReduce dataflows –HDFS tracks where data is stored –operations scheduled nearby their data 3

An example... Given two datasets: list of words and their frequency of appearance on webpages list of users and webpages they visit Let’s find words users might be interested in lately. 4

Dataset: words and their frequency of appearance... websitewordfrequencydate news.bbc.co.ukobama abcnews.go.comscheme abcnews.go.combombing abcnews.go.comcongress

Dataset: webpages users visit... website user bill news.bbc.co.uk mike mike bill drew james abcnews.go.com james 6

Loading word frequency data... freqs =LOAD '/home/jolly/TestData/NewsWords.txt' USING PigStorage(',') ‏ AS (website_indexed, word, freq, date); (news.bbc.co.uk, obama, 0.010, ) ‏ (abcnews.go.com, scheme, 0.025, ) ‏ (abcnews.go.com, bombing, 0.021, ) ‏ ( bush, 0.001, ) ‏ ( mccain, 0.031, ) ‏ ( obama, 0.001, ) ‏ ( bush, 0.012, ) ‏ (abcnews.go.com, congress, 0.002, ) ‏ ( bush, 0.012, ) ‏ ( bush, 0.001, ) ‏ ( abortion, 0.001, ) ‏ ( attack, 0.010, ) ‏ ( obama, 0.005, ) ‏ ( economy, 0.038, ) ‏ 7

Hmm, we have some repeats... (news.bbc.co.uk, obama, 0.010, ) ‏ (abcnews.go.com, scheme, 0.025, ) ‏ (abcnews.go.com, bombing, 0.021, ) ‏ ( bush, 0.001, ) ‏ ( mccain, 0.031, ) ‏ ( obama, 0.001, ) ‏ ( bush, 0.012, ) ‏ (abcnews.go.com, congress, 0.002, ) ‏ ( bush, 0.012, ) ‏ ( bush, 0.001, ) ‏ ( abortion, 0.001, ) ‏ ( attack, 0.010, ) ‏ ( obama, 0.005, ) ‏ ( economy, 0.038, ) ‏ 8

Duplicate data no more! distinct_freqs = DISTINCT freqs; ( obama, 0.001, ) ‏ ( mccain, 0.031, ) ‏ (abcnews.go.com, scheme, 0.025, ) ‏ (abcnews.go.com, bombing, 0.021, ) ‏ (abcnews.go.com, congress, 0.002, ) ‏ (news.bbc.co.uk, obama, 0.010, ) ‏ ( bush, 0.001, ) ‏ ( economy, 0.038, ) ‏ ( attack, 0.010, ) ‏ ( abortion, 0.001, ) ‏ ( bush, 0.012, ) ‏ ( obama, 0.005, ) ‏ 9

Hmm, these tuples are old… ( obama, 0.001, ) ‏ ( mccain, 0.031, ) ‏ (abcnews.go.com, scheme, 0.025, ) ‏ (abcnews.go.com, bombing, 0.021, ) ‏ (abcnews.go.com, congress, 0.002, ) ‏ (news.bbc.co.uk, obama, 0.010, ) ‏ ( bush, 0.001, ) ‏ ( economy, 0.038, ) ‏ ( attack, 0.010, ) ‏ ( abortion, 0.001, ) ‏ ( bush, 0.012, ) ‏ ( obama, 0.005, ) ‏ 10

... and these (green) tuples are not very significant. ( obama, 0.001, ) ‏ ( mccain, 0.031, ) ‏ (abcnews.go.com, scheme, 0.025, ) ‏ (abcnews.go.com, bombing, 0.021, ) ‏ (abcnews.go.com, congress, 0.002, ) ‏ (news.bbc.co.uk, obama, 0.010, ) ‏ ( bush, 0.001, ) ‏ ( economy, 0.038, ) ‏ ( attack, 0.010, ) ‏ ( abortion, 0.001, ) ‏ ( bush, 0.012, ) ‏ ( obama, 0.005, ) ‏ 11

Let’s filter them out. important_freqs = FILTER distinct_freqs BY date > AND freq > 0.002; ( mccain, 0.031, ) ‏ (abcnews.go.com, scheme, 0.025, ) ‏ (abcnews.go.com, bombing, 0.021, ) ‏ (news.bbc.co.uk, obama, 0.010, ) ‏ ( economy, 0.038, ) ‏ ( attack, 0.010, ) ‏ 12

Hmm, we don’t need these anymore... ( mccain, 0.031, ) ‏ (abcnews.go.com, scheme, 0.025, ) ‏ (abcnews.go.com, bombing, 0.021, ) ‏ (news.bbc.co.uk, obama, 0.010, ) ‏ ( economy, 0.038, ) ‏ ( attack, 0.010, ) ‏ 13

Let’s project them out. websites_to_words = FOREACH important_freqs GENERATE website_indexed, word; ( mccain) ‏ (abcnews.go.com, scheme) ‏ (abcnews.go.com, bombing) ‏ (news.bbc.co.uk, obama) ‏ ( economy) ‏ ( attack) ‏ 14

Now we are ready to join our lists. Websites to Users (news.bbc.co.uk, mike) ‏ ( mike) ‏ ( bill) ‏ ( drew) ‏ ( james) ‏ (abcnews.go.com, james) ‏ Websites to Words ( mccain) ‏ (abcnews.go.com, scheme) ‏ (abcnews.go.com, bombing) ‏ (news.bbc.co.uk, obama) ‏ ( economy) ‏ ( attack) ‏ 15

Joining on website: finding words interesting to users... users_to_words_equijoin = JOIN websites_to_users BY website_visited, websites_to_words BY website_indexed; users_to_words = FOREACH users_to_words_equijoin GENERATE user, word; (mike, mccain) ‏ (james, scheme) ‏ (james, bombing) ‏ (mike, obama) ‏ (bill, economy) ‏ (james, attack) ‏ 16

Let’s group our results. interests = GROUP users_to_words BY user; (bill, {(bill, economy)}) ‏ (mike, {(mike, mccain), (mike, obama)}) ‏ (james, {(james, scheme), (james, bombing), (james, attack)}) ‏ 17

How does it work? logic factored into MapReduce jobs –mapper processes run on machines with input tuples –input tuples processed using MAP( ) function, producing intermediate tuples –intermediate tuples grouped together, transferred to reducer nodes –reducer processes consume intermediate tuples with REDUCE( ) function 18

Translating Pig Latin to MapReduce... transformed_by_map = FOREACH input_tuple GENERATE MAP(*); intermediate_tuple_partition = GROUP transformed_by_map BY input_tuple_key; result_tuples = FOREACH intermediate_tuple_partition GENERATE REDUCE(*); These statements can be executed using a single MapReduce job: 19

Example message traffic... 20

Why Pig Latin? Why not a C library? We could just supply MAP( ) and REDUCE( ) to a C library... Pig Latin allows you to: describe long tasks –in a friendly scripting language use many built-in datatypes –support for semi-structured data use many built-in functions –filters, projections, joins, unions, splits, etc. –tends to make user-defined functions simpler 21

Why Pig Latin? Why not SQL? Pig Latin: is imperative –lets users manually tune query execution plan doesn’t need a schema –can easily read, write, and represent semi-structured data 22

Pig Latin really describes a generic dataflow. inputs = LOAD ‘input.txt’; results = FILTER inputs BY IsBoring(important_attribute); STORE results into ‘results.txt’; 23

Summary Pig Latin programs: typically operate on large volumes of unstructured data describe a dataflow between primitive operations –many RDBMS-like operations built into the language –custom operations can be provided by the user –user specifies order of operations –dataflows can be executed using MapReduce paradigm Thanks for listening! 24