Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.

Slides:



Advertisements
Similar presentations
Alan F. Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M. Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, Utkarsh.
Advertisements

Hui Li Pig Tutorial Hui Li Some material adapted from slides by Adam Kawa the 3rd meeting of WHUG June 21, 2012.
Hadoop Pig By Ravikrishna Adepu.
CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.
© Hortonworks Inc Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop.
Alan F. Gates Yahoo! Pig, Making Hadoop Easy Who Am I? Pig committer Hadoop PMC Member An architect in Yahoo! grid team Or, as one coworker put.
High Level Language: Pig Latin Hui Li Judy Qiu Some material adapted from slides by Adam Kawa the 3 rd meeting of WHUG June 21, 2012.
Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olsten, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Acknowledgement.
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
Big Data Infrastructure
The Hadoop Stack, Part 1 Introduction to Pig Latin CSE – Cloud Computing – Fall 2014 Prof. Douglas Thain University of Notre Dame.
Presented By: Imranul Hoque
MapReduce and databases Data-Intensive Information Processing Applications ― Session #7 Jimmy Lin University of Maryland Tuesday, March 23, 2010 This work.
Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research Shimin Chen Big.
Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.
Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing.
(Hadoop) Pig Dataflow Language B. Ramamurthy Based on Cloudera’s tutorials and Apache’s Pig Manual 6/27/2015.
Pig Latin Olston, Reed, Srivastava, Kumar, and Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. SIGMOD Shahram Ghandeharizadeh.
The Pig Experience: Building High-Level Data flows on top of Map-Reduce The Pig Experience: Building High-Level Data flows on top of Map-Reduce DISTRIBUTED.
CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.
CS 345D Semih Salihoglu (some slides are copied from Ilan Horn, Jeff Dean, and Utkarsh Srivastava’s presentations online) MapReduce System and Theory 1.
Data-Intensive Computing with MapReduce
CS346: Advanced Databases
HADOOP ADMIN: Session -2
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Pig Acknowledgement: Modified slides from Duke University 04/13/10 Cloud Computing Lecture.
Presenters: Abhishek Verma, Nicolas Zea.  Map Reduce  Clean abstraction  Extremely rigid 2 stage group-by aggregation  Code reuse and maintenance.
Pig: Making Hadoop Easy Wednesday, June 10, 2009 Santa Clara Marriott.
Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Yahoo! Research.
Pig Latin CS 6800 Utah State University. Writing MapReduce Jobs Higher order functions Map applies a function to a list Example list [1, 2, 3, 4] Want.
CSE 486/586 CSE 486/586 Distributed Systems Data Analytics Steve Ko Computer Sciences and Engineering University at Buffalo.
Making Hadoop Easy pig
Storage and Analysis of Tera-scale Data : 2 of Database Class 11/24/09
MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.
Lecture 09: Parallel Databases, Big Data, Map/Reduce, Pig-Latin Wednesday, November 23 rd, 2011 Dan Suciu -- CSEP544 Fall
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
MAP-REDUCE ABSTRACTIONS 1. Abstractions On Top Of Hadoop We’ve decomposed some algorithms into a map-reduce “workflow” (series of map-reduce steps) –
MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 彭波 北京大学信息科学技术学院 7/14/2009.
Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.
CS347: Map-Reduce & Pig Hector Garcia-Molina Stanford University CS347Notes 09 1.
Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.
Big Data Infrastructure Week 3: From MapReduce to Spark (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.
CS 347MapReduce1 CS 347 Distributed Databases and Transaction Processing Distributed Data Processing Using MapReduce Hector Garcia-Molina Zoltan Gyongyi.
What is Pig ???. Why Pig ??? MapReduce is difficult to program. It only has two phases. Put the logic at the phase. Too many lines of code even for simple.
Data Cleansing with Pig Latin. Neubot Tests Data Structure.
MapReduce Compilers-Apache Pig
Pig, Making Hadoop Easy Alan F. Gates Yahoo!.
Hadoop.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
HADOOP ADMIN: Session -2
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Spark SQL.
Pig : Building High-Level Dataflows over Map-Reduce
Spark SQL.
RDDs and Spark.
Pig Latin - A Not-So-Foreign Language for Data Processing
Pig Latin: A Not-So-Foreign Language for Data Processing
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Hector Garcia-Molina Stanford University
The Idea of Pig Or Pig Concepts
Pig : Building High-Level Dataflows over Map-Reduce
CSE 491/891 Lecture 21 (Pig).
Charles Tappert Seidenberg School of CSIS, Pace University
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Hadoop – PIG.
Pig and pig latin: An Introduction
Big Data Technology: Introduction to Hadoop
Pig Hive HBase Zookeeper
Presentation transcript:

Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research

Data Processing Renaissance  Internet companies swimming in data E.g. TBs/day at Yahoo!  Data analysis is “inner loop” of product innovation  Data analysts are skilled programmers

Type of processing for data analysis [My Slide] Ad-hoc Large data sets Scan oriented offline

Map Reduce V.S. Data Warehousing [My Slide] Map ReduceData Warehouse Easy to Code (programmers prefer this!)Everything is a SQL query Choice of language (java, python …)Need to use T-SQL (not intuitive) Parallelism is managed by systemParallelism is tricky Open sourceExpensive (teradata, Netezza) Code is difficult to reuse and maintainCode can be reused No self describing input/output formatsFormats are defined by schema Joins are cumbersomeJoins are easy to do

New Systems For Data Analysis  Map-Reduce  Apache Hadoop  Dryad...

Pig Latin … what? [My slide] Pig “Latin” is the declarative language Pig is the system that compiles this language down into Map Reduce / Hadoop

Map-Reduce Input records k1k1 v1v1 k2k2 v2v2 k1k1 v3v3 k2k2 v4v4 k1k1 v5v5 map k1k1 v1v1 k1k1 v3v3 k1k1 v5v5 k2k2 v2v2 k2k2 v4v4 Output records reduce Just a group-by-aggregate? SELECT key, F(value) FROM Input GROUP BY key

Example Data Analysis Task UserUrlTime Amycnn.com8:00 Amybbc.com10:00 Amyflickr.com10:05 Fredcnn.com12:00 Find the top 10 most visited pages in each category UrlCategoryPageRank cnn.comNews0.9 bbc.comNews0.8 flickr.comPhotos0.7 espn.comSports0.9 VisitsUrl Info

Data Flow Load Visits Group by url Foreach url generate count Foreach url generate count Load Url Info Join on url Group by category Foreach category generate top10 urls Foreach category generate top10 urls

In Pig Latin [My Slide … somewhat] visits = load ‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(visits); urlInfo = load ‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’; Operate Directly over files, Optional Schema Track Progress, High level (the WHAT not HOW)

Step-by-step Procedural Control Target users are entrenched procedural programmers The step-by-step method of creating a program in Pig is much cleaner and simpler to use than the single block method of SQL. It is easier to keep track of what your variables are, and where you are in the process of analyzing your data. Jasmine Novak Engineer, Yahoo! Automatic query optimization is hard Pig Latin does not preclude optimization With the various interleaved clauses in SQL, it is difficult to know what is actually happening sequentially. With Pig, the data nesting and the temporary tables get abstracted away. Pig has fewer primitives than SQL does, but it’s more powerful. David Ciemiewicz Search Excellence, Yahoo!

Pig Latin has a fully-nestable data model with: – Atomic values, tuples, bags (lists), and maps More natural to programmers than flat tuples Nested Data Model yahoo, finance news

Compilation into Map-Reduce Load Visits Group by url Foreach url generate count Foreach url generate count Load Url Info Join on url Group by category Foreach category generate top10(urls) Foreach category generate top10(urls) Map 1 Reduce 1 Map 2 Reduce 2 Map 3 Reduce 3 Every group or join operation forms a map-reduce boundary Other operations pipelined into map and reduce phases

Other Constructs [My Slide] LOAD queries = LOAD `query_log.txt‘ USING myLoad() AS (userId, queryString, timestamp); FOREACH, GENERATE expanded_queries = FOREACH queries GENERATE userId, expandQuery(queryString); FILTER real_queries = FILTER queries BY NOT isBot(userId); FLATTEN map_result = FOREACH input GENERATE FLATTEN(map(*)); STORE STORE query_revenues INTO `myoutput‘ USING myStore();

COGROUP [my slide] If you want to aggregate top differently and side differently, this can Be done here. Cumbersome in SQL

Pig Pen

Discussion Not great for any kind of matrix/graph operations Didn’t mention how PIG can be scripted – Useful for redoing processing The process of obtaining the sandbox dataset is interesting