Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hadoop Pig By Ravikrishna Adepu.

Similar presentations


Presentation on theme: "Hadoop Pig By Ravikrishna Adepu."— Presentation transcript:

1 Hadoop Pig By Ravikrishna Adepu

2 Overview What is Pig? Motivation How is it being used
Data Model/Architecture Components Pig Latin By Example

3 What is Pig? Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.

4 PigLatin Pig's language layer currently consists of a textual language called Pig Latin, which has the following key properties: Ease of programming.  Optimization opportunities.  Extensibility : Users can create their own functions to do special-purpose processing.

5 Hadoop Pig Architecture
Client Machine (Pig job submission) Pig->Map reduce transformations Map Reduce Jobs HDFS(Hadoop Distributed File System) Client Machine Map Reduce Transformations Map reduce Jobs HDFS

6 Features Simple to understand data flow language for analysis familiar with scripting languages Fast , iterative language with strong map reduce compilation engine Rich, multivalued nested operations performed on large datasets

7 Pig v/s SQL Pig SQL Pig is procedural SQL is declarative
Nested relational data model (No constraints on Data Types) Schema is optional Scan-centric analytic workloads (No Random reads or writes) Limited query optimization SQL is declarative Flat relational data model (Data is tied to a specific Data Type) Schema is required OLTP + OLAP workloads Significant opportunity for query optimization insert into ValuableClicksPerDMA select dma, count(*) from geoinfo join ( select name, ipaddr from users join clicks on (users.name = clicks.user) where value > 0; ) using ipaddr group by dma; The Pig Latin for this will look like: Users = load 'users' as (name, age, ipaddr); Clicks = load 'clicks' as (user, url, value); ValuableClicks = filter Clicks by value > 0; UserClicks = join Users by name, ValuableClicks by user; Geoinfo = load 'geoinfo' as (ipaddr, dma); UserGeo = join UserClicks by ipaddr, Geoinfo by ipaddr; ByDMA = group UserGeo by dma; ValuableClicksPerDMA = foreach ByDMA generate group, COUNT(UserGeo); store ValuableClicksPerDMA into 'ValuableClicksPerDMA';

8 Pig procedural v/s SQL declarative
Users = load 'users' as (name, age, ipaddr); Clicks = load 'clicks' as (user, url, value); ValuableClicks = filter Clicks by value > 0; UserClicks = join Users by name, ValuableClicks by user; Geoinfo = load 'geoinfo' as (ipaddr, dma); UserGeo = join UserClicks by ipaddr, Geoinfo by ipaddr; ByDMA = group UserGeo by dma; ValuableClicksPerDMA = foreach ByDMA generate group, COUNT(UserGeo); store ValuableClicksPerDMA into 'ValuableClicksPerDMA'; insert into ValuableClicksPerDMA select dma, count(*) from geoinfo join ( select name, ipaddr from users join clicks on (users.name = clicks.user) where value > 0; ) using ipaddr group by dma; insert into ValuableClicksPerDMA select dma, count(*) from geoinfo join ( select name, ipaddr from users join clicks on (users.name = clicks.user) where value > 0; ) using ipaddr group by dma; The Pig Latin for this will look like: Users = load 'users' as (name, age, ipaddr); Clicks = load 'clicks' as (user, url, value); ValuableClicks = filter Clicks by value > 0; UserClicks = join Users by name, ValuableClicks by user; Geoinfo = load 'geoinfo' as (ipaddr, dma); UserGeo = join UserClicks by ipaddr, Geoinfo by ipaddr; ByDMA = group UserGeo by dma; ValuableClicksPerDMA = foreach ByDMA generate group, COUNT(UserGeo); store ValuableClicksPerDMA into 'ValuableClicksPerDMA';

9 Motivation behind Pig Challenges :
Map reduce requires a Java Programmer Map reduce can require multiple stages to come to solution User has to reinvent common functionality (join,filter etc) Long development cycle with rigorous testing states

10 Solution : Opens the systems to he users familiar with PHP, Ruby,Python 4hrs in Java -> 15 minutes in PigLatin Provide common operations like Join, group, filter and sort etc Pig provides PigLatin that increases productivity * 10

11 How is Pig being used Web log processing
Data processing for web search platforms Ad hoc queries across large data sets Rapid prototyping of algorithms for large data sets Quick fact : 70% of production jobs at Yahoo Inc being used by Hadoop Pig

12 Pig Processing : Grunt ,the pig shell Submit a script directly
Pig server java class, a JDBC like interface Pig Pen which Allows textual & graphical scripting Samples data & shows example data flow

13 Components : Pig resides on user machine
No need to install extra cluster Job submitted to cluster & executed on cluster

14 First look at the program :
Let’s first look at the programming language itself so that you can see how it’s significantly easier than having to write mapper and reducer programs. The first step in a Pig program is to LOAD the data you want to manipulate from HDFS. Then you run the data through a set of transformations(which, under the covers, are translated into a set of mapper and reducer tasks). Finally, you DUMP the data to the screen or you STORE the results in a file somewhere.

15 Starting grunt : cd /usr/share/doc/pig-0.11.0+44/examples/data ls
$Pig –x local You should see a prompt like Grunt> We can run Pig in two modes Stand alone mode(local mode) Distributed mode(Map reduce mode)

16 Execution Modes Pig has two execution modes: Local Mode :
To run Pig in local mode, you need access to a single machine; all files are installed and run using your local host and file system. Specify local mode using the -x flag (pig -x local). Mapreduce Mode : To run Pig in mapreduce mode, you need access to a Hadoop cluster and HDFS installation. Mapreduce mode is the default mode; you can, but don't need to, specify it using the -x flag (pig OR pig -x mapreduce).

17 Storing Final Results:
Loading Data : Use the LOAD operator and the load/store functions to read data into Pig (PigStorage is the default load function). Storing Final Results: Use the STORE operator and the load/store functions to write results to the file system (PigStorage is the default store function). Debugging Pig Latin

18 Continued : Pig Latin provides operators that can help you debug your Pig Latin statements: Use the DUMP operator to display results to your terminal screen. Use the DESCRIBE operator to review the schema of a relation. Use the EXPLAIN operator to view the logical, physical, or map reduce execution plans to compute a relation. Use the ILLUSTRATE operator to view the step-by-step execution of a series of statements.

19 Piglatin data types : Basic data types : INT LONG FLOAT DOUBLE
CHARARRAY BYTEARRAY BOOLEAN

20 Continued : Complex data types BAG TUPLE MAP Syntax {(data_type) |  (tuple(data_type))  | (bag{tuple(data_type)}) | (map[]) } field

21 Usage : Cast operators enable you to cast or convert data from one type to another, as long as conversion is supported (see the table above). For example, suppose you have an integer field, myint, which you want to convert to a string. You can cast this field from int to chararray using (chararray) myint. A field can be explicitly cast. Once cast, the field remains that type (it is not automatically cast back). In this example $0 is explicitly cast to int. B = FOREACH A GENERATE (int)$0 + 1; Where possible, Pig performs implicit casts. In this example $0 is cast to int (regardless of underlying data) and $1 is cast to double. B = FOREACH A GENERATE $0 + 1, $

22 Tuple construction A = load 'students' as (name:chararray, age:int,gpa:float); B = foreach A generate (name, age); store B into ‘results’; Input (students): joe smith amy chen leo allen Output (results): (joe smith,20) (amy chen,22) (leo allen,18)

23 Bag Construction A = load 'students' as (name:chararray, age:int, gpa:float); B = foreach A generate {(name, age)}, {name, age}; store B into ‘results’; Input (students): Joe smith amy chen leo allen Output (results): {(joe smith,20)} {(joe smith),(20)} {(amy chen,22)} {(amy chen),(22)} {(leo allen,18)} {(leo allen),(18)}

24 Map construction A = load 'students' as (name:chararray, age:int, gpa:float); B = foreach A generate [name, gpa]; store B into ‘results’; Input (students): joe smith amy chen leo allen Output (results): [joe smith#3.5] [amy chen#3.2] [leo allen#2.1]

25 Piglatin: UDF Pig provides extensive support for user-defined functions (UDFs) as a way to specify custom processing. Functions can be a part of almost every operator in Pig All UDF’s are case sensitive

26 UDF: Types Eval Functions (EvalFunc)
Ex: StringConcat (built-in) : Generates the concatenation of the first two fields of a tuple. Aggregate Functions (EvalFunc & Algebraic) Ex: COUNT, AVG ( both built-in) Filter Functions (FilterFunc) Ex: IsEmpty (built-in) Load/Store Functions (LoadFunc/ StoreFunc) Ex: PigStorage (built-in) Note: URL for built in functions:

27 How It Works pig.jar: Execution Plan parses Map: checks Filter
A = LOAD ‘myfile’ AS (x, y, z); B = FILTER A by x > 0; C = GROUP B BY x; D = FOREACH A GENERATE x, COUNT(B); STORE D INTO ‘output’; pig.jar: parses checks optimizes plans execution submits jar to Hadoop monitors job progress Execution Plan Map: Filter Count Combine/Reduce: Sum

28 Project Word count using Hadoop Pig : Preparing a text file :
It’s definitely a little more interesting if you can work with some data you know or at least have an interest in. I used sample data provided by cloudera for Hadoop Pig.

29 Import the file into the Sandbox
Go to the File Browser tab and upload the .txt file. Take note of the default location it is loading to (/user/hue) Write a Pig script to parse the data and dump to a file : --script starts here a = load '/user/hue/word_count_text.txt'; b = foreach a generate flatten(TOKENIZE((chararray)$0)) as word; c = group b by word; d = foreach c generate COUNT(b), group; store d into '/user/hue/pig_wordcount'; /* multi line comments */

30 RESULTS

31 References http://en.wikipedia.org/wiki/Pig_(programming_tool)

32 QUESTIONS?

33 Thank you !


Download ppt "Hadoop Pig By Ravikrishna Adepu."

Similar presentations


Ads by Google