Big Data Analytics Training

Big Data Analytics Training
ATI

Course Contents Environment Setup required Download pig from
Download pig tar.gz

Pig & SQOOP Session 1 : Pig Why another language or interface?
Introduction to Pig What is Pig Latin? Pig Architecture Components Pig Configuration Data Types Common Query Algorithms User Defined Functions Pig Use Cases Session 2 : SQOOP Introduction to SQOOP Install and Configure SQOOP Generate Code SQOOP Import & Export Hands On

Why Another Language? Map Reduce requires knowledge or expertise of Java Need to rewrite the common algorithms like JOIN. FILTER etc. Hive is helpful in analyzing structured data – SQL type queries Making Hadoop available for other non java programmers

What is Pig Latin High Level Language that abstracts Hadoop system complexity from users Data Flow language, not a procedural language Provides common operators like join, group, sort etc. Can use existing user code or libraries for complex non regular algorithms Operate on files in HDFS Developed by Yahoo for their internal use and later contributed to community and made open source They run most of their jobs using Pig

Pig Architecture Pig Scripts Map Reduce Jobs Grunt Shell - CLI
No need to install anything on the cluster Grunt Shell converts Pig scripts into MR programs and submits to the cluster for execution Hadoop Cluster

Pig Configuration Download and un-tar the pig file under /home/hadoop
tar -xzf pig tar.gz Configure the PIG Paths export PIG_INSTALL=/home/hadoop/pig export PATH=$PATH:$PIG_INSTALL/bin

Pig Configuration Configure Hadoop Cluster Details
Using environment variables export PIG_HADOOP_VERSION=“0.13.0” export PIG_CLASSPATH=$HADOOP_INSTALL/conf Using pig.properties file fs.default.name=hdfs://localhost/ mapred.job.tracker=localhost:8021

Running Pig pig grunt>cust=LOAD ‘retail.cust’ AS (custid:chararray, firstname:chararray, lastname:chararray, age:long, profession:chararray); grunt>lmt = LIMIT cust 10; grunt>groupByProfession = GROUP cust BY profession; grunt>countByProfession = FOREACH groupByProfession GENERATE group, count(cust) ; grunt>dump countByProfession; grunt>STORE countByProfession INTO ‘output’;

Pig Data Types – Scalar Types
INT – Signed 32 bit inetger LONG – Signed 64 bit inetger FLOAT – 32 bit floating point DOUBLE – 64 bit floating point CHARARRAY – Character array (String) in Unicode UTF-8 BYTEARRAY – Byte array (Binary object)

Pig Data Types - Complex
map Associative array Tuple Ordered list of data Ex: (1234, Kim Huston, 54) Bag – unordered collection of tuples {(1234, Kim Huston, 54), (7634, Harry Slater, 41), (4355, Rod Stewart, 43, architect)} Tuples in a Bag are not required to have the same schema or same number of fields Can represent semi structured or unstructured data

Pig – LOAD Data LOAD DUMP A; (1, 2, 3) (4, 2, 1) (8, 3, 4)
A = LOAD ‘myfile.txt’; A = LOAD ‘myfile.txt’ AS (f1:int, f2:int, f3:int); A = LOAD ‘myfile.txt’ USING PigStorage(‘\t’); A = LOAD ‘myfile.txt’ USING PigStorage(‘\t’) AS (f1:int, f2:int, f3:int); 1 2 3 4 2 1 8 3 4 DUMP A; (1, 2, 3) (4, 2, 1) (8, 3, 4)

Pig – STORE Data STORE cat myoutput; (1, 2, 3) (4, 2, 1) (8, 3, 4)
STORE B INTO ‘myoutput’ using PigStorage(‘,’); 1 2 3 4 2 1 8 3 4 cat myoutput; (1, 2, 3) (4, 2, 1) (8, 3, 4)

Pig – Grouping and Aggregation
Group alias By {[field_alias [,field_alias]} Can be grouped by multiple fields The output has key field named group grunt>groupByProfession = GROUP cust BY profession; GENERATE controls the output grunt>countByProfession = FOREACH groupByProfession GENERATE group, count(cust);

Pig – Built-in functions
AVG, CONCAT, COUNT, DIFF, MAX, MIN, SIZE, SUM, TOKENIZE, IsEmpty

Pig – Filtering and Ordering
Selects tuples based on a boolean expression. Select or remove tuples grunt>teenagers = FILTER cust BY age <20; grunt>topProfessions = FILTER countByProfession BY count > 1000; Ordering Sorting – ascending or descending Grunt>sorted = ORDER countByProfession By count ASC / DESC; Can sort by more than one field

Pig – FOREACH and DISTINCT
Iteration through a list grunt>countByProfession = FOREACH groupByProfession GENERATE group, count(cust); DISTINCT Select only distinct tuples Removed duplicate entries Grunt>distinctProfession = DISTINCT groupByProfession;

Pig – GROUP A = LOAD ‘data’ as (f1:chararray, f2:int, f3:int); DUNP A;
In this example, the tuples are grouped using an expression (f2*f3) X = GROUP A By f2*f3; DUMP X; (2, {(r1,1,2), (r2,2,1)} (16, {r3, 2,8), (r4, 4, 4)}

Pig – COGROUP DUMP A; B = LOAD ‘data2’ AS
A = LOAD ‘data1’ AS (owner:chararray, pet:chararray); DUMP A; (Alice, turtle) (Alice, goldfish) (Alice, cat) (Bob, dog) (Bob, cat) B = LOAD ‘data2’ AS (friend1:chararray, friend2:chararray); DUMP B; (Cindy, Alice) (Mark, Alice) (Paul, Bob) (Paul, Jane) X = COGROUP A BY owner, B BY friend2; (Alice, {(Alice, turtle), (Alice, goldfish), (Alice, cat)}, {(Cindy,Alice),(Mark,Alice)}) (Bob, {(Bob,dog),(Bob,cat)},{(Paul,Bob)}) (Jane,{},{(Paul,Jane)})

JOINS DUMP A; (1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5) (8,4,3) DUMP B;
(2,4) (8,9) (1,3) (2,7) (2,9) (4,6) (4,9) * X = JOIN A BY a1, B BY b1; DUMP X; (1,2,3,1,3) (4,2,1,4,6) (4,3,3,4,6) (4,2,1,4,9) (4,3,3,4,9) (8,3,4,8,9) (8,4,3,8,9)

Pig - UNION DUMP A; (1,2,3) (4,2,1) DUMP B; (2,4) (8,9) (1,3)
* X = UNION A, B; DUMP X; (1,2,3) (4,2,1) (2,4) (8,9) (1,3)

Pig - SPLIT A = LOAD ‘data’ AS (f1:int, f2:int, f3:int) DUMP A;
(1,2,3) (4,5,6) (7,8,9) SPLIT A INTO X IF f1<7, Y IF f2==5, Z IF (f3<6 OR f3>6); DUMP X; (1,2,3) (4,5,6) DUMP Y; DUMP Z; (7,8,9)

Pig - Sample Creates a sampling of large data set
Example: 1% of total data set A = LOAD ‘data’ AS (f1:int, f2:int, f3:int); X = SAMPLE A 0.01

Pig - UDF For logic that cannot be done in Pig
Can be used to do column transformation, filtering, ordering, custom aggregation For example, you want to write custom logic to do interest calculation or penalty calculation Example: Grunt> interest = FOREACH cust GENERATE custid, calculateInterest(custAcc);

Pig – Diagnostic Operators
DESCRIBE Display the schema of a relation EXPLAIN Display the execution plan used to compute a relation ILLUSTRATE Display step-by-step how data is transformed, starting with a load command, to arrive at the resulting relation Only a sample of the input data is used to simulate the execution

Pig – Writing Macros DEFINE a macro
DEFINE my_macro(A, sortkey) RETURNS C { B = FILTER $A BY my_filter(*); $C = ORDER B BY $sortkey; } X = my_macro(a,b); STORE X INTO ‘output’; IMPORT a macro into another macro or script IMPORT ‘my_macro.pig’;

Hadoop for production environment
For Development Standalone or psedu-distribution For Production Fully distributed mode Support platforms – Linux Can run on commodity hardware. But nor cheap Can choose from a range of vendors Hadoop is designed to withstand failure, but too many failures can affect performance So choose appropriate hardware

Cluster topology High end machines are recommended for NameNode as failure of this node can bring down the whole cluster Gigabit network connection for low latency in data transfer across networks Node n Data Nodes Task Tracker Node1 Data Nodes Task Tracker

Hadoop for production environment
Several nodes per rack Single rack or multi-rack configuration For multiple rack - rack to node mapping needs to be done For hadoop to schedule map reduce tasks appropriately For placing replications in the same rack to avoid transfers across rack

Cluster configuration
Install Java, Hadoop on all nodes Create a single hadoop user across all nodes and configure password less SSH Configure with Name node Masters – specify which nodes will run secondary name node Slaves – specify which nodes will run data node and task tracker Configuring other properties of the nodes Create one master configuration and use tools like chef and puppet to configure the whole nodes Use NFS to store and share configuration files

Big Data Analytics Training

Similar presentations

Presentation on theme: "Big Data Analytics Training"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Big Data Analytics Training

Similar presentations

Presentation on theme: "Big Data Analytics Training"— Presentation transcript:

Similar presentations

About project

Feedback