Pig Latin CS 6800 Utah State University. Writing MapReduce Jobs Higher order functions Map applies a function to a list Example list [1, 2, 3, 4] Want.

Slides:



Advertisements
Similar presentations
Oracle Data Warehouse Mit Big Data neue Horizonte für das Data Warehouse ermöglichen Alfred Schlaucher, Detlef Schroeder DATA WAREHOUSE.
Advertisements

Alan F. Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M. Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, Utkarsh.
How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations Thejas Nair pig Yahoo! Apache pig.
Hui Li Pig Tutorial Hui Li Some material adapted from slides by Adam Kawa the 3rd meeting of WHUG June 21, 2012.
Hadoop Pig By Ravikrishna Adepu.
Your Name.  Recap  Advance  Built-In Function  UDF  Conclusion.
Alan F. Gates Yahoo! Pig, Making Hadoop Easy Who Am I? Pig committer and PMC Member An architect in Yahoo! grid team Photo credit: Steven Guarnaccia,
CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.
© Hortonworks Inc Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop.
Working with pig Cloud computing lecture. Purpose  Get familiar with the pig environment  Advanced features  Walk though some examples.
High Level Language: Pig Latin Hui Li Judy Qiu Some material adapted from slides by Adam Kawa the 3 rd meeting of WHUG June 21, 2012.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture VII: 2014/04/21.
Pig Contributors Workshop Agenda Introductions What we are working on Usability Howl TLP Lunch Turing Completeness Workflow Fun (Bocci ball)
Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olsten, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Acknowledgement.
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
The Hadoop Stack, Part 1 Introduction to Pig Latin CSE – Cloud Computing – Fall 2014 Prof. Douglas Thain University of Notre Dame.
Presented By: Imranul Hoque
The Pig Latin Dataflow Language A Brief Overview James Jolly University of Wisconsin-Madison
(Hadoop) Pig Dataflow Language B. Ramamurthy Based on Cloudera’s tutorials and Apache’s Pig Manual 6/27/2015.
Pig Latin Olston, Reed, Srivastava, Kumar, and Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. SIGMOD Shahram Ghandeharizadeh.
CS 405G: Introduction to Database Systems 24 NoSQL Reuse some slides of Jennifer Widom Chen Qian University of Kentucky.
CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.
HADOOP ADMIN: Session -2
Committed to Deliver….  We are Leaders in Hadoop Ecosystem.  We support, maintain, monitor and provide services over Hadoop whether you run apache Hadoop,
Pig Acknowledgement: Modified slides from Duke University 04/13/10 Cloud Computing Lecture.
Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.
Cloud Computing Other High-level parallel processing languages Keke Chen.
Big Data Analytics Training
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Introduction to Hadoop and HDFS
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
CSE 486/586 CSE 486/586 Distributed Systems Data Analytics Steve Ko Computer Sciences and Engineering University at Buffalo.
Making Hadoop Easy pig
Storage and Analysis of Tera-scale Data : 2 of Database Class 11/24/09
MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.
An Introduction to HDInsight June 27 th,
RESTORE IMPLEMENTATION as an extension to pig Vijay S.
Presented by Priagung Khusumanegara Prof. Kyungbaek Kim
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
MAP-REDUCE ABSTRACTIONS 1. Abstractions On Top Of Hadoop We’ve decomposed some algorithms into a map-reduce “workflow” (series of map-reduce steps) –
Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.
Alan Gates Becoming a Pig Developer Who Am I? Pig committer Hadoop PMC Member Yahoo! architect for Pig.
Pig, a high level data processing system on Hadoop Gang Luo Nov. 1, 2010.
Pig Installation Guide and Practical Example Presented by Priagung Khusumanegara Prof. Kyungbaek Kim.
Other Map-Reduce (ish) Frameworks William Cohen 1.
NoSQL Systems Motivation. NoSQL: The Name  “SQL” = Traditional relational DBMS  Recognition over past decade or so: Not every data management/analysis.
Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.
Apache Pig CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.
What is Pig ???. Why Pig ??? MapReduce is difficult to program. It only has two phases. Put the logic at the phase. Too many lines of code even for simple.
Moscow, November 16th, 2011 The Hadoop Ecosystem Kai Voigt, Cloudera Inc.
MapReduce Compilers-Apache Pig
CS 405G: Introduction to Database Systems
Mail call Us: / / Hadoop Training Sathya technologies is one of the best Software Training Institute.
Pig, Making Hadoop Easy Alan F. Gates Yahoo!.
Running TPC-H On Pig Jie Li, Koichi Ishida, Muzhi Zhao, Ralf Diestelkaemper, Xuan Wang, Yin Lin CPS 216: Data Intensive Computing Systems Dec 9, 2011.
MSBIC Hadoop Series Processing Data with Pig
Pig Latin - A Not-So-Foreign Language for Data Processing
NoSQL Systems Overview (as of November 2011).
Pig Latin: A Not-So-Foreign Language for Data Processing
Slides borrowed from Adam Shook
Pig from Alan Gates’ book (In preparation for exam2)
CSE 491/891 Lecture 21 (Pig).
(Hadoop) Pig Dataflow Language
Hadoop – PIG.
(Hadoop) Pig Dataflow Language
Pig and pig latin: An Introduction
Pig Hive HBase Zookeeper
Presentation transcript:

Pig Latin CS 6800 Utah State University

Writing MapReduce Jobs Higher order functions Map applies a function to a list Example list [1, 2, 3, 4] Want to square each number in the list Write function f(x) = x*x Compute [f(1), f(2), f(3), f(4)] = [1, 4, 9, 16] map function signature: (a -> b) -> [a] -> [b] Haskell specification map f [] = [] map f (x:xs) = (f x) :: (map f xs) Call the function map (\x -> x * x) [1, 2, 3, 4]

Reduce Reduce converts a list into a scalar Example list [1, 2, 3, 4] Want to sum the numbers in the list Write function g(x,y) = x+y Compute g(1,g(2,g(3,g(4,0)))) = 10 reduce signature: (a -> b -> c) -> b -> [a] -> c Haskell specification reduce g c [] = c reduce g c (x:xs) = g x (reduce g c xs) Call the function reduce (\x -> x + x) 0 [1, 2, 3, 4]

Use in Cloud Computing Map can be used to clean data and "group" it Suppose a list of words words = [Bat Volcano bat vulcano] Map to lower case lcase = map lowercase words Map to correct spelling s = map spellFix lcase Count each word groups = map (\x -> (x, 1)) s groups is [(bat, 1), (volcano, 1), (bat, 1) …

Use in Cloud Computing (continues) Shuffles collects tuples with same "group" value Reduce combines counts result = reduce + 0 groups Problem - MapReduce jobs written in PL (e.g., Java) Complicated Not reusable Database-like operations common

CouchDB - Count People per Gender

Pig Latin Yahoo 40% of Hadoop jobs run using Pig Platform for analyzing massive data sets Runs on Hadoop (Map/Reduce) Version 0.12

What is Pig Latin? Dataflow language Non 1NF data model Tuples Sets Bags Use relational algebra-like operations to manipulate data Joins Filter - selection Generate - projection Compiles to MapReduce jobs on Hadoop cluster

Pig Latin Features A dataflow (NoSQL) language SQL is declarative, most PLs are not SQL poor at expressing workflow Non-1NF data model Bags, sets, tuples, maps Data resides in read-only files Schema-less

Example Count subscribers in each city A = LOAD ’subscribers.txt’ AS (name: chararray, city: chararray, amount: int); B = GROUP A BY city; C = FOREACH B GENERATE city, COUNT(B.name); DUMP C; Dataflow LOAD … GROUP A … A B C FOREACH B …

Compilation Pig Latin Compiler Map Reduce HDFS Map Reduce HDFS Map Reduce HDFS Map Reduce HDFS Map Reduce HDFS Map Reduce HDFS Hadoop Map Reduce Job Pig Latin Program Result

Data Transformations Relational algebra-like JOIN (inner and outer joins) FILTER (selection) FOREACH (projection) CROSS (product) UNION SQL-like DISTINCT LIMIT ORDER BY GROUP Non-traditional COGROUP MAPREDUCE FLATTEN RANK STREAM SAMPLE SPLIT

Magazine Subscriber Data Subscribers (Maya, Logan, $20, 1) (Jose, Logan, $15, 2) (Name, City, Amt, Id) (Knut, Ogden, $20, 3)... Personal Information (Maya, 5) (Jose, 6) (Name, , Id) (Knut, 7)...

FILTER A filter restricts the result /* Restrict to Logan subscribers */ X = FILTER R ON city = "Logan"; FILTER example Subscribers (Name, City, Amt, Id) (Maya, Logan, $20, 1) (Jose, Logan, $15, 2) (Knut, Ogden, $20, 3)...

Magazine Subscriber Data Subscribers (Maya, Logan, $20, 1) (Jose, Logan, $15, 2) (Name, City, Amt, Id) (Knut, Ogden, $20, 3)... Personal Information (Maya, 5) (Jose, 6) (Name, , Id) (Knut, 7)... B = JOIN Subscribers BY name, PerInfo By name

Magazine Subscriber Data B (Maya, Logan, $20, 1, Maya, 5) (Jose, Logan, $15, 2, Jose, 6) (Name, City, Amt, Id, Name, , Id) (Knut, Ogden, $20, 3, Knut, 7)... B = JOIN Subscribers BY name, PerInfo By name

Optimization FILTER … AB C JOIN …FILTER … D E CROSS … Map/Reduce