Design of Pig B. Ramamurthy. Pig’s data model Scalar types: int, long, float (early versions, recently float has been dropped), double, chararray, bytearray.

Slides:



Advertisements
Similar presentations
CC SQL Utilities.
Advertisements

How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations Thejas Nair pig Yahoo! Apache pig.
Pig Optimization and Execution Page 1 Alan F. © Hortonworks Inc
Hui Li Pig Tutorial Hui Li Some material adapted from slides by Adam Kawa the 3rd meeting of WHUG June 21, 2012.
Hadoop Pig By Ravikrishna Adepu.
Your Name.  Recap  Advance  Built-In Function  UDF  Conclusion.
CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.
© Hortonworks Inc Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop.
Working with pig Cloud computing lecture. Purpose  Get familiar with the pig environment  Advanced features  Walk though some examples.
High Level Language: Pig Latin Hui Li Judy Qiu Some material adapted from slides by Adam Kawa the 3 rd meeting of WHUG June 21, 2012.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture VII: 2014/04/21.
Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olsten, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Acknowledgement.
Dr. Kalpakis CMSC 661, Principles of Database Systems Representing Data Elements [12]
Design of Pig B. Ramamurthy. Pig’s data model Scalar types: int, long, float (early versions, recently float has been dropped), double, chararray, bytearray.
2010/11 : [1]Building Web Applications using MySQL and PHP (W1)MySQL Recap.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Query Evaluation. SQL to ERA SQL queries are translated into extended relational algebra. Query evaluation plans are represented as trees of relational.
(Hadoop) Pig Dataflow Language B. Ramamurthy Based on Cloudera’s tutorials and Apache’s Pig Manual 6/27/2015.
CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.
Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.
Differences between C# and C++ Dr. Catherine Stringfellow Dr. Stewart Carpenter.
Structured Query Language (SQL) A2 Teacher Up skilling LECTURE 2.
Pig Acknowledgement: Modified slides from Duke University 04/13/10 Cloud Computing Lecture.
Data Formats CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
Avro Apache Course: Distributed class Student ID: AM Name: Azzaya Galbazar
ASP.NET Programming with C# and SQL Server First Edition
3.1 Chapter 3: SQL Schema used in examples p (omit 3.8.2, , 3.11)
Pig: Making Hadoop Easy Wednesday, June 10, 2009 Santa Clara Marriott.
Introduction to database systems
Big Data Analytics Training
Pig Latin CS 6800 Utah State University. Writing MapReduce Jobs Higher order functions Map applies a function to a list Example list [1, 2, 3, 4] Want.
CHAPTER:14 Simple Queries in SQL Prepared By Prepared By : VINAY ALEXANDER ( विनय अलेक्सजेंड़र ) PGT(CS),KV JHAGRAKHAND.
Making Hadoop Easy pig
Storage and Analysis of Tera-scale Data : 2 of Database Class 11/24/09
MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.
1 © 2002, Cisco Systems, Inc. All rights reserved. Arrays Chapter 7.
Copyright © 2010 Certification Partners, LLC -- All Rights Reserved Perl Specialist.
Presented by Priagung Khusumanegara Prof. Kyungbaek Kim
MAP-REDUCE ABSTRACTIONS 1. Abstractions On Top Of Hadoop We’ve decomposed some algorithms into a map-reduce “workflow” (series of map-reduce steps) –
IFS Intro to Data Management Chapter 5 Getting More Than Simple Columns.
Queries SELECT [DISTINCT] FROM ( { }| ),... [WHERE ] [GROUP BY [HAVING ]] [ORDER BY [ ],...]
Chapter 2 Variables.
Copyright © 2003 ProsoftTraining. All rights reserved. Perl Fundamentals.
SAP DEVELOPMENT BASICS Bohuslav Tesar. TRAINING OVERVIEW Amazing life of ABAP developer ;) SAP introduction ABAP basics ABAP Reporting.
Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.
int [] scores = new int [10];
Aggregator Stage : Definition : Aggregator classifies data rows from a single input link into groups and calculates totals or other aggregate functions.
SQL: Structured Query Language It enables to create and operate on relational databases, which are sets of related information stored in tables. It is.
Apache Pig CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.
What is Pig ???. Why Pig ??? MapReduce is difficult to program. It only has two phases. Put the logic at the phase. Too many lines of code even for simple.
Apache Avro CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.
Arrays Collections of data Winter 2004CS-1010 Dr. Mark L. Hornick 1.
Data Cleansing with Pig Latin. Neubot Tests Data Structure.
Mail call Us: / / Hadoop Training Sathya technologies is one of the best Software Training Institute.
Pig, Making Hadoop Easy Alan F. Gates Yahoo!.
Hadoop.
Basic select statement
MSBIC Hadoop Series Processing Data with Pig
Design of Pig B. Ramamurthy.
Overview of Hadoop MapReduce MapReduce is a soft work framework for easily writing applications which process vast amounts of.
Pig Latin - A Not-So-Foreign Language for Data Processing
Pig Data flow language (abstraction for MR jobs)
Pig Data flow language (abstraction for MR jobs)
Slides borrowed from Adam Shook
Pig from Alan Gates’ book (In preparation for exam2)
The Idea of Pig Or Pig Concepts
CSE 491/891 Lecture 21 (Pig).
Pig Data flow language (abstraction for MR jobs)
04 | Processing Big Data with Pig
Lecture 20: Representing Data Elements
Presentation transcript:

Design of Pig B. Ramamurthy

Pig’s data model Scalar types: int, long, float (early versions, recently float has been dropped), double, chararray, bytearray Complex types: Map, Map: chararray to any pig element; in fact, this to mapping; map constants [‘name’#’bob’, ‘age’#55] will create a map with two keys name and age, first value is chararray and the second value is an integer. Tuple: is a fixed length ordered collection of Pig data elements. Equivalent to a roq in SQL. Order, can refer to elements by field position. (‘bob’, 55) is a tuple with two fields. Bag: unodered collection of tuples. Cannot reference tuple by position. Eg. {(‘bob’,55), (‘sally’,52), (‘john’, 25)} is a bog with 3 tuples; bogs may become large and may spill into disk from “in-memory” Null: unknown, data missing; any data element can be null; (In Java it is Null pointers… the meaning is different in Pig)

Pig schema Very relaxed wrt schema. Scheme is defined at the time you load the data Table 4-1 Runtime declaration of schemes is really nice. You can operate without meta data. On the other hand, meta data can be stored in a repository Hcatalog and used. For example JSON format… etc. Gently typed: between Java and Perl at two extremes

Schema Definition divs = load ‘NYSE_dividends’ as (exchange:chararray, symbol:chararray, date:chararray, dividend:double); Or if you are lazy divs = load ‘NYSE_dividends’ as (exchange, symbol, date, dividend); But what if the data input is really complex? Eg. JSON objects? One can keep a scheme in the HCatalog (apache incubation), a meta data repository for facilitating reading/loading input data in other formats. divs = load ‘mydata’ using HCatLoader();

Pig Latin Basics: keywords, relation names, field names; Keywords are not case sensitive but relation and fields names are! User defined functions are also case sensitive Comments /* */ or single line comment – Each processing step results in data – Relation name = data operation – Field names start with aplhabet

More examples No pig-schema daily = load ‘NYSE_daily’; calcs = foreach daily generate $7/100.0, SUBSTRING($0,0,1), $6-$3); Here – is only numeric on Pig) No-schema filter daily = load ‘NYSE_daily’; fltrd = filter daily by $6 > $3; Here > is allowed for numeric, bytearray or chararray.. Pig is going to guess the type! Math (float cast) daily = load ‘NYSE_daily’ as (exchange, symbol, date, open, high:float,low:float, close, volume:int, adj_close); rough = foreach daily generate volume * close; -- will convert to float Thus the free “typing” may result in unintended consequences.. Be aware. Pig is sometimes stupid. For a more in-depth view look at also hoe “casts” are done in Pig.

Load (input method) Can easily interface to hbase: read from hbase using clause – divs = load ‘NYSE_dividends’ using HBaseStorage(); – divs = load ‘NYSE_dividends’ using PigStorage(); – divs = load ‘NYSE_dividends’ using PigStorage(,); as clause – daily = load ‘NYSE_daily’ as (exchange, symbol, date, open, high,low, close, volume);

Store & dump Default is PigStorage (it writes as tab separated) – store processed into ‘/data/example/processed’; For comma separated use: – store processed into ‘/data/example/processed’ using PigStorage(,); Can write into hbase using HBaseStorage(): – store ‘processed’ using into HBaseStorage(); Dump for interactive debugging, and prototyping

Relational operations Allow you to transform by sorting, grouping, joining, projecting and filtering foreach supports as array of expressions: simplest is constants and field references. rough = foreach daily generate volume * close; calcs = foreach daily generate $7/100.0, SUBSTRING($0,0,1), $6-$3); UDF (User Defined Functions) can also be used in expressions Filter operation CM syms = filter divs by symbol matches ‘CM*’;

Operations (cntd) Group operation collects together records with the same key. – grpd = group daily by stock; -- output is – counts = foreach grpd generate group, COUNT(daily); – Can also group by multiple keys – grpd = group daily by (stock, exchange); Group forces the “reduce” phase of MR Pig offers mechanism for addressing data skew and unbalanced use of reducers (we will not worry about this now)

Order by Strict total order… Example: daily = load “NYSE_daily” as (exchange, symbol, close, open,…) bydate = order daily by date; bydateandsymbol = order daily by date, symbol; byclose = order by close desc, open;

More functions distinct primitive: to remove duplicates Limit: divs = load ‘NYSE_dividends’; first10 = limit divs 10; Sample divs = load ‘NYSE_dividends’; some = sample divs 0.1;

More functions Parallel daily = load ‘NYSE_daily’; bysym = group daily by symbol parallel 10; (10 reducers) Register, piggybank.jar register ‘piggybank.jar’ divs = load ‘NYSE_dividens’; backwds = foreach divs generate Reverse(symbol); Illustrate, describe …

How do you use pig? To express the logical steps in big data analytics For prototyping? For domain experts who don’t want to learn MR but want to do big data For a one-time job: probably will not be repeated Quick demo of the MR capabilities Good for discussion of initial MR design & planning (group, order etc.) Excellent interface to a data warehouse