Pig and pig latin: An Introduction

Slides:



Advertisements
Similar presentations
Alan F. Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M. Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, Utkarsh.
Advertisements

How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations Thejas Nair pig Yahoo! Apache pig.
Pig Optimization and Execution Page 1 Alan F. © Hortonworks Inc
© Hortonworks Inc Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop.
Alan F. Gates Yahoo! Pig, Making Hadoop Easy Who Am I? Pig committer Hadoop PMC Member An architect in Yahoo! grid team Or, as one coworker put.
Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olsten, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Acknowledgement.
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
The Hadoop Stack, Part 1 Introduction to Pig Latin CSE – Cloud Computing – Fall 2014 Prof. Douglas Thain University of Notre Dame.
Presented By: Imranul Hoque
MapReduce and databases Data-Intensive Information Processing Applications ― Session #7 Jimmy Lin University of Maryland Tuesday, March 23, 2010 This work.
Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research Shimin Chen Big.
Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.
Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing.
Pig Latin Olston, Reed, Srivastava, Kumar, and Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. SIGMOD Shahram Ghandeharizadeh.
The Pig Experience: Building High-Level Data flows on top of Map-Reduce The Pig Experience: Building High-Level Data flows on top of Map-Reduce DISTRIBUTED.
Cloud Computing Other Mapreduce issues Keke Chen.
CS 345D Semih Salihoglu (some slides are copied from Ilan Horn, Jeff Dean, and Utkarsh Srivastava’s presentations online) MapReduce System and Theory 1.
HADOOP ADMIN: Session -2
Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.
Presenters: Abhishek Verma, Nicolas Zea.  Map Reduce  Clean abstraction  Extremely rigid 2 stage group-by aggregation  Code reuse and maintenance.
Pig: Making Hadoop Easy Wednesday, June 10, 2009 Santa Clara Marriott.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Yahoo! Research.
Big Data Analytics Training
Pig Latin CS 6800 Utah State University. Writing MapReduce Jobs Higher order functions Map applies a function to a list Example list [1, 2, 3, 4] Want.
Introduction to Hadoop and HDFS
CSE 486/586 CSE 486/586 Distributed Systems Data Analytics Steve Ko Computer Sciences and Engineering University at Buffalo.
Storage and Analysis of Tera-scale Data : 2 of Database Class 11/24/09
MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.
Restore : Reusing results of mapreduce jobs Jun Fan.
Presented by Priagung Khusumanegara Prof. Kyungbaek Kim
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
MAP-REDUCE ABSTRACTIONS 1. Abstractions On Top Of Hadoop We’ve decomposed some algorithms into a map-reduce “workflow” (series of map-reduce steps) –
Hung-chih Yang 1, Ali Dasdan 1 Ruey-Lung Hsiao 2, D. Stott Parker 2
MapReduce 高层应用 & 频繁集挖掘算法的 MapReduce 实现 彭波 北京大学信息科学技术学院 7/14/2009.
Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.
Alan Gates Becoming a Pig Developer Who Am I? Pig committer Hadoop PMC Member Yahoo! architect for Pig.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.
Big Data Infrastructure Week 3: From MapReduce to Spark (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.
What is Pig ???. Why Pig ??? MapReduce is difficult to program. It only has two phases. Put the logic at the phase. Too many lines of code even for simple.
MapReduce Compilers-Apache Pig
Big Data Infrastructure
Mail call Us: / / Hadoop Training Sathya technologies is one of the best Software Training Institute.
Pig, Making Hadoop Easy Alan F. Gates Yahoo!.
Hadoop.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Gergely Lukács Pázmány Péter Catholic University
Spark SQL.
Pig : Building High-Level Dataflows over Map-Reduce
Spark SQL.
RDDs and Spark.
Introduction to MapReduce and Hadoop
Pig Latin - A Not-So-Foreign Language for Data Processing
Pig Latin: A Not-So-Foreign Language for Data Processing
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Big Data Computing Models Map-Reduce and Variants
MANAGING DATA RESOURCES
Overview of big data tools
Pig from Alan Gates’ book (In preparation for exam2)
Pig : Building High-Level Dataflows over Map-Reduce
CSE 491/891 Lecture 21 (Pig).
How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations Pig performance has been improving because of the optimizations.
Charles Tappert Seidenberg School of CSIS, Pace University
Introduction to MapReduce
Hadoop – PIG.
5/7/2019 Map Reduce Map reduce.
Big Data Technology: Introduction to Hadoop
Presentation transcript:

Pig and pig latin: An Introduction

Outline Map-Reduce and the need for Pig Latin Pig Latin example Salient features Implementation

Data Processing Renaissance Internet companies swimming in data Data analysis is “inner loop” of product innovation Data analysts are skilled programmers

Data Warehousing …? Often not scalable enough Scale Prohibitively expensive at web scale Up to $200K/TB $ $ $ $ Little control over execution method Query optimization is hard Parallel environment Little or no statistics Lots of UDFs SQL

The Map-Reduce Appeal Scalable due to simpler design Only parallelizable operations No transactions Scale $ Runs on cheap commodity hardware SQL Procedural Control- a processing “pipe”

Disadvantages M R M M R M 1. Extremely rigid data flow Other flows constantly hacked in M M R M Join, Union Split Chains 2. Common operations must be coded by hand Join, filter, projection, aggregates, sorting, distinct 3. Semantics hidden inside map-reduce functions Difficult to maintain, extend, and optimize

Control over execution Pros And Cons Need a high-level, general data flow language Scalable Cheap Control over execution Inflexible Lots of hand coding Semantics hidden

Control over execution Enter Pig Latin Need a high-level, general data flow language Scalable Cheap Control over execution Pig Latin

Outline Map-Reduce and the need for Pig Latin Pig Latin example Salient features Implementation

Example Data Analysis Task Find the top 10 most visited pages in each category Visits Url Info User Url Time Amy cnn.com 8:00 bbc.com 10:00 flickr.com 10:05 Fred 12:00 Url Category PageRank cnn.com News 0.9 bbc.com 0.8 flickr.com Photos 0.7 espn.com Sports

Data Flow Load Visits Group by url Foreach url Load Url Info generate count Load Url Info Join on url Group by category Foreach category generate top10 urls

In Pig Latin visits = load ‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(visits); urlInfo = load ‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’;

Outline Map-Reduce and the need for Pig Latin Pig Latin example Salient features Implementation

Quick Start and Interoperability visits = load ‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(urlVisits); urlInfo = load ‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’; Operates directly over files

Quick Start and Interoperability visits = load ‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(urlVisits); urlInfo = load ‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’; Schemas optional; Can be assigned dynamically

User-Code as a First-Class Citizen visits = load ‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(urlVisits); urlInfo = load ‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’; User-defined functions (UDFs) can be used in every construct Load, Store Group, Filter, Foreach

Nested Data Model Pig Latin has a fully-nestable data model with: Atomic values, tuples, bags (lists), and maps More natural to programmers than flat tuples Avoids expensive joins See paper yahoo , finance email news

Outline Map-Reduce and the need for Pig Latin Pig Latin example Novel features Implementation

Implementation SQL Pig Pig is open-source. http://pig.apache.org/ user automatic rewrite + optimize Pig Pig is open-source. http://pig.apache.org/ or or Hadoop Map-Reduce cluster

Compilation into Map-Reduce Every group or join operation forms a map-reduce boundary Load Visits Group by url Reduce1 Map2 Foreach url generate count Load Url Info Join on url Reduce2 Map3 Other operations pipelined into map and reduce phases Group by category Reduce3 Foreach category generate top10(urls)