MSBIC Hadoop Series Processing Data with Pig

MSBIC Hadoop Series Processing Data with Pig
Bryan Smith

MSBIC Hadoop Series http://msbic.sqlpass.org/
Learn the basics of Hadoop through a combination of demonstration and lecture. Session participants are invited to follow along leveraging emulation environments and Azure-based clusters, the setting up of which we will address in our first session. March – Getting Started August – Processing the Data with Pig April – Understanding the File System September – OOF May – Implementing MapReduce Jobs October – Hadoop & MS BI June – Querying the Data with Hive November – TBD July – On Vacation December – TBD

Today’s Session Objectives: Understand the basics of Pig
Demonstrate use of Pig with sample data set

Hadoop Ecosystem File System (HDFS, WASB, etc.) Job Execution
(MapReduce, Tez, etc.) Scripting (Pig) Query (Hive) Metadata Services (HCatlog) Management & Monitoring (Ambari, Zookeeper) Workflow & Scheduling (Oozie) Non-Relational Database (Hbase) Data Integration (Flume, Sqoop) What was described is a very, very high-level simplification of what takes place but hopefully it illustrates the basics of how Hadoop works and provides you a basis for understanding Hadoop as a distributed storage and processing platform In addition, what was describe was simply the core of Hadoop. Hadoop is a collection of projects, each of which adds functionality to the Hadoop ecosystem. Some of those projects are shown here but this is by no mean complete.

Pig Pig is a data flow engine for Hadoop Data flows scripted in Pig Latin

UFO Sightings Data Set DateObserved DateReported Location Shape Duration Description In this demo, we will process the ufo_awesome.tsv file to get a count of sightings by year and ufo type.

Data Flow Date Observed Date Report Location Shape Year Observed Shape
Duration Description Year Observed Year Observed Shape Observations Derive Year Observed Count by & Shape 1 3 5 7 2 4 6 Read Data File Export Data to File 2 6 1 3 5 7 4

Data Flow Date Observed Date Report Location Shape Duration
Description Year Observed Shape Observations Derive Year Observed Aggregate Around Year Observed & Shape Read Data File Export Data to 2 1 3 4 5 6 7 ufo = load '/demo/ufo/in/ufo_awesome.tsv' as (dateobs:chararray, daterpt:chararray, location:chararray, shape:chararray, duration:chararray, description:chararray); ufo2 = foreach ufo generate SUBSTRING(dateobs,0,4) as yearobs, TRIM(shape) as shape; ufo3 = group ufo2 by (yearobs, shape); ufo4 = foreach ufo3 generate group, COUNT(ufo2) as sightings; ufo5 = foreach ufo4 generate group.yearobs as yearobs, group.shape as shape, sightings; dump ufo5; 1 2 3 4 5 6 7

Demo Script: Basic Data Flow
ufo = load '/demo/ufo/in/ufo_awesome.tsv' as (dateobs:chararray, daterpt:chararray, location:chararray, shape:chararray, duration:chararray, description:chararray); ufo2 = foreach ufo generate SUBSTRING(dateobs,0,4) as yearobs, TRIM(shape) as shape; ufo3 = group ufo2 by (yearobs, shape); ufo4 = foreach ufo3 generate group, COUNT(ufo2) as sightings; ufo5 = foreach ufo4 generate group.yearobs as yearobs, group.shape as shape, sightings; dump ufo5; Load assumes tab delimited by default; will load data to binary if no type provided Function names case sensitive

Statements Data Input/Output Projection Limit Extension
Load Foreach Store Filter Sample Dump Group Extension Schema/Workflow Cogroup Stream Order Describe Mapreduce Distinct Explain Join Illustrate Union Cross

Bags & Tuples Tuple – a data record with fields of various types Example: (yearobs, shape) Bag – a collection of tuples Example: {(yearobs, shape), (yearobs, shape), (yearobs, shape)} A field within a tuple can be a simple (scalar) data type or a complex type such as a bag

Group Statement ufo3 = group ufo2 by (yearobs, shape);
ufo4 = foreach ufo3 generate group, COUNT(ufo2) as sightings; ( (2001, egg), { (2001, egg), (2001, egg), (2001, egg) } ) ( (“group” tuple), { “ufo2” bag} ) a single tuple within ufo3 relation The group statement creates a bag of tuples Bag referenced as “group”, tuples referenced using name of relation from which they were derived ( (2010, rectangle) { (2010, rectangle), } )

vs. Hive Hive Pig Query language Accesses data in target folder Struct, array & map types supported Data flow (ETL) language Accesses data in target folder or file Tuple, map & bag types supported

Demo Script: Loading to HCatalog
ufo = load '/demo/ufo/in/ufo_awesome.tsv' as (dateobs:chararray, daterpt:chararray, location:chararray, shape:chararray, duration:chararray, description:chararray); store ufo into 'ufo.sightings_pig' using org.apache.hcatalog.pig.HCatStorer(); NOTES Launch Pig with –useHCatalog option. If on HDI Emulator, HCAT_HOME error resolved here Be sure target table is empty

Resources

Today’s Session Objectives: Understand the basics of Pig
Demonstrate use of Pig with sample data set

For Next Session Topic: Requested Action(s): Integrating with MS BI
Come with working HDInsight Emulator or HDInsight Cluster Load sample data sets into HDFS on Emulator Define Hcatalog tables on sample data sets

MSBIC Hadoop Series Processing Data with Pig

Similar presentations

Presentation on theme: "MSBIC Hadoop Series Processing Data with Pig"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

MSBIC Hadoop Series Processing Data with Pig

Similar presentations

Presentation on theme: "MSBIC Hadoop Series Processing Data with Pig"— Presentation transcript:

Similar presentations

About project

Feedback