Download presentation
Presentation is loading. Please wait.
1
04 | Processing Big Data with Pig
Graeme Malcolm | Data Technology Specialist, Content Master Pete Harris | Learning Product Planner, Microsoft
2
Module Overview What is Pig? Pig Latin Common Pig Latin Operations
Pig Latin and Map/Reduce Using Pig in PowerShell
3
What is Pig? Pig Latin statements perform a series of transformations to data relations Relations are loaded using schema on read semantics to project table structure at runtime Run Pig Latin statements interactively in the Grunt shell, or save a script file and run them as a batch
4
Pig Latin ,12 ,14 ,16 ,9 ,12 ... -- Load comma-delimited source data. Default data type is chararray, but temp is a long int Readings = LOAD '/weather/data.txt' USING PigStorage(',') AS (date, temp:long); -- Group the tuples by date GroupedReadings = GROUP Readings BY date; -- Get the average temp value for each date grouping GroupedAvgs = FOREACH GroupedReadings GENERATE group, AVG(Readings.temp) AS avgtemp; -- Ungroup the dates with the average temp AvgWeather = FOREACH GroupedAvgs GENERATE FLATTEN(group) as date, avgtemp; -- Sort the results by date SortedResults = ORDER AvgWeather BY date ASC; -- Save the results in the /weather/summary folder STORE SortedResults INTO '/weather/summary';
5
Common Pig Latin Operations
LOAD FILTER FOR EACH … GENERATE ORDER JOIN GROUP FLATTEN LIMIT DUMP STORE
6
Pig Latin and Map/Reduce
Pig generates Map/Reduce code from Pig Latin Map/Reduce jobs are generated on: DUMP STORE Readings = LOAD '/weather/data.txt' USING PigStorage(',') AS (date, temp:long); GroupedReadings = GROUP Readings BY date; GroupedAvgs = FOREACH GroupedReadings GENERATE group, AVG(Readings.temp) AS avgtemp; AvgWeather = FOREACH GroupedAvgs GENERATE FLATTEN(group) as date, avgtemp; SortedResults = ORDER AvgWeather BY date ASC; STORE SortedResults INTO '/weather/summary'; Map/Reduce code generated here
7
Demo: Using Pig In this demonstration, you will see how to:
Use the Grunt Shell to Run Pig Latin Statements
8
Using Pig in PowerShell
$jobDef = New-AzureHDInsightPigJobDefinition -Query $PigLatin $pigJob = Start-AzureHDInsightJob –Cluster $clusterName –JobDefinition $jobDef Use the New-AzureHDInsightPigJobDefinition cmdlet to define Pig jobs Use –Query to run explicit Pig Latin statements Use –File to reference a script file Run the job with the Start-AzureHDInsightJob cmdlet
9
Demo: Using Pig in PowerShell
In this demonstration, you will see how to: View a Pig Latin Script Use PowerShell to Run a Pig Job View Output from a Pig Job
10
Module Summary Pig Latin is an extensive language, and easier than writing custom Map/Reduce classes Pig suits scenarios where data can be processed as a series of transformations You can run Pig Latin statements and scripts from PowerShell
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.