CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.

Slides:



Advertisements
Similar presentations
Introduction to Apache HIVE
Advertisements

Alan F. Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M. Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, Utkarsh.
Hui Li Pig Tutorial Hui Li Some material adapted from slides by Adam Kawa the 3rd meeting of WHUG June 21, 2012.
Hadoop Pig By Ravikrishna Adepu.
© Hortonworks Inc Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop.
CS525: Special Topics in DBs Large-Scale Data Management HBase Spring 2013 WPI, Mohamed Eltabakh 1.
Working with pig Cloud computing lecture. Purpose  Get familiar with the pig environment  Advanced features  Walk though some examples.
High Level Language: Pig Latin Hui Li Judy Qiu Some material adapted from slides by Adam Kawa the 3 rd meeting of WHUG June 21, 2012.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture VII: 2014/04/21.
Pig Contributors Workshop Agenda Introductions What we are working on Usability Howl TLP Lunch Turing Completeness Workflow Fun (Bocci ball)
Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olsten, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Acknowledgement.
Hive - A Warehousing Solution Over a Map-Reduce Framework.
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Hive: A data warehouse on Hadoop
(Hadoop) Pig Dataflow Language B. Ramamurthy Based on Cloudera’s tutorials and Apache’s Pig Manual 6/27/2015.
CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.
A warehouse solution over map-reduce framework Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff.
Raghav Ayyamani. Copyright Ellis Horowitz, Why Another Data Warehousing System? Problem : Data, data and more data Several TBs of data everyday.
Hive – A Warehousing Solution Over a Map-Reduce Framework Presented by: Atul Bohara Feb 18, 2014.
Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.
HADOOP ADMIN: Session -2
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Pig Acknowledgement: Modified slides from Duke University 04/13/10 Cloud Computing Lecture.
Hive : A Petabyte Scale Data Warehouse Using Hadoop
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Cloud Computing Other High-level parallel processing languages Keke Chen.
NoSQL continued CMSC 461 Michael Wilson. MongoDB  MongoDB is another NoSQL solution  Provides a bit more structure than a solution like Accumulo  Data.
Big Data Analytics Training
Pig Latin CS 6800 Utah State University. Writing MapReduce Jobs Higher order functions Map applies a function to a list Example list [1, 2, 3, 4] Want.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Introduction to Hadoop and HDFS
Distributed Systems Fall 2014 Zubair Amjad. Outline Motivation What is Sqoop? How Sqoop works? Sqoop Architecture Import Export Sqoop Connectors Sqoop.
Hive Facebook 2009.
Storage and Analysis of Tera-scale Data : 2 of Database Class 11/24/09
MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.
An Introduction to HDInsight June 27 th,
A NoSQL Database - Hive Dania Abed Rabbou.
Presented by Priagung Khusumanegara Prof. Kyungbaek Kim
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
Hive. What is Hive? Data warehousing layer on top of Hadoop – table abstractions SQL-like language (HiveQL) for “batch” data processing SQL is translated.
Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.
Apache Pig CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.
What is Pig ???. Why Pig ??? MapReduce is difficult to program. It only has two phases. Put the logic at the phase. Too many lines of code even for simple.
MapReduce Compilers-Apache Pig
HIVE A Warehousing Solution Over a MapReduce Framework
Mail call Us: / / Hadoop Training Sathya technologies is one of the best Software Training Institute.
Pig, Making Hadoop Easy Alan F. Gates Yahoo!.
Hadoop.
HADOOP ADMIN: Session -2
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
A Warehousing Solution Over a Map-Reduce Framework
Getting Data into Hadoop
Hive Mr. Sriram
Pig Latin - A Not-So-Foreign Language for Data Processing
Hadoop EcoSystem B.Ramamurthy.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Pig Latin: A Not-So-Foreign Language for Data Processing
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Server & Tools Business
Slides borrowed from Adam Shook
Pig - Hive - HBase - Zookeeper
CSE 491/891 Lecture 21 (Pig).
CSE 491/891 Lecture 24 (Hive).
Charles Tappert Seidenberg School of CSIS, Pace University
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
(Hadoop) Pig Dataflow Language
(Hadoop) Pig Dataflow Language
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Pig Hive HBase Zookeeper
Presentation transcript:

CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1

Hadoop Ecosystem 2 We covered these Next week we cover more of these

Query Languages for Hadoop Java: Hadoop’s Native Language Pig: Query and Workflow Language Hive: SQL-Based Language HBase: Column-oriented Database for MapReduce 3

Java is Hadoop’s Native Language Hadoop itself is written in Java Provided Java APIs For mappers, reducers, combiners, partitioners Input and output formats Other languages, e.g., Pig or Hive, convert their queries to Java MapReduce code 4

Levels of Abstraction 5 More Hadoop visible Less Hadoop visible More map-reduce view More DB view

Java Example 6 map reduce Job conf.

Apache Pig 7

What is Pig A platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs. Compiles down to MapReduce jobs Developed by Yahoo! Open-source language 8

High-Level Language 9 raw = LOAD 'excite.log' USING PigStorage('\t') AS (user, id, time, query); clean1 = FILTER raw BY id > 20 AND id < 100; clean2 = FOREACH clean1 GENERATE user, time, org.apache.pig.tutorial.sanitze(query) as query; user_groups = GROUP clean2 BY (user, query); user_query_counts = FOREACH user_groups GENERATE group, COUNT(clean2), MIN(clean2.time), MAX(clean2.time); STORE user_query_counts INTO 'uq_counts.csv' USING PigStorage(',');

Pig Components High-level language (Pig Latin) Set of commands 10 Two execution modes Local: reads/write to local file system Mapreduce: connects to Hadoop cluster and reads/writes to HDFS Two Main Components Two modes Interactive mode Console Batch mode Submit a script

Why Pig?...Abstraction! Common design patterns as key words (joins, distinct, counts) Data flow analysis A script can map to multiple map-reduce jobs Avoids Java-level errors (not everyone can write java code) Can be interactive mode Issue commands and get results 11

Example I: More Details 12 raw = LOAD 'excite.log' USING PigStorage('\t') AS (user, id, time, query); clean1 = FILTER raw BY id > 20 AND id < 100; clean2 = FOREACH clean1 GENERATE user, time, org.apache.pig.tutorial.sanitze(query) as query; user_groups = GROUP clean2 BY (user, query); user_query_counts = FOREACH user_groups GENERATE group, COUNT(clean2), MIN(clean2.time), MAX(clean2.time); STORE user_query_counts INTO 'uq_counts.csv' USING PigStorage(','); Read file from HDFS The input format (text, tab delimited) Define run-time schema Filter the rows on predicates For each row, do some transformation Grouping of records Compute aggregation for each group Store the output in a file Text, Comma delimited

Pig: Language Features Keywords Load, Filter, Foreach Generate, Group By, Store, Join, Distinct, Order By, … Aggregations Count, Avg, Sum, Max, Min Schema Defines at query-time not when files are loaded UDFs Packages for common input/output formats 13

Example 2 A = load '$widerow' using PigStorage('\u0001') as (name: chararray, c0: int, c1: int, c2: int); B = group A by name parallel 10; C = foreach B generate group, SUM(A.c0) as c0, SUM(A.c1) as c1, AVG(A.c2) as c2; D = filter C by c0 > 100 and c1 > 100 and c2 > 100; store D into '$out'; 14 Script can take arguments Data are “ctrl-A” delimited Define types of the columns Specify the need of 10 reduce tasks

Example 3: Re-partition Join register pigperf.jar; A = load ‘page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent, query_term, timestamp, estimated_revenue); B = foreach A generate user, (double) estimated_revenue; alpha = load ’users' using PigStorage('\u0001') as (name, phone, address, city, state, zip); beta = foreach alpha generate name, city; C = join beta by name, B by user parallel 40; D = group C by $0; E = foreach D generate group, SUM(C.estimated_revenue); store E into 'L3out'; 15 Register UDFs & custom inputformats Function the jar file to read the input file Group after the join (can reference columns by position) Join the two datasets (40 reducers) Load the second file

Example 4: Replicated Join register pigperf.jar; A = load ‘page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent, query_term, timestamp, estimated_revenue); Big = foreach A generate user, (double) estimated_revenue; alpha = load ’users' using PigStorage('\u0001') as (name, phone, address, city, state, zip); small = foreach alpha generate name, city; C = join Big by user, small by name using ‘replicated’; store C into ‘out'; 16 Map-only join (the small dataset is the second)

A = LOAD 'data' AS (f1:int,f2:int,f3:int); DUMP A; (1,2,3) (4,5,6) (7,8,9) SPLIT A INTO X IF f1 6); DUMP X; (1,2,3) (4,5,6) DUMP Y; (4,5,6) STORE x INTO 'x_out'; STORE y INTO 'y_out'; STORE z INTO 'z_out'; Example 5: Multiple Outputs 17 Split the records into sets Dump command to display the data Store multiple outputs

Run independent jobs in parallel 18 D1 = load 'data1' … D2 = load 'data2' … D3 = load 'data3' … C1 = join D1 by a, D2 by b C2 = join D1 by c, D3 by d C1 and C2 are two independent jobs that can run in parallel

Pig Latin: CoGroup Combination of join and group by Make use of Pig nested structure 19

Pig Latin vs. SQL Pig Latin is procedural (dataflow programming model) Step-by-step query style is much cleaner and easier to write SQL is declarative but not step-by-step style 20 SQL Pig Latin

Pig Latin vs. SQL In Pig Latin Lazy evaluation (data not processed prior to STORE command) Data can be stored at any point during the pipeline Schema and data types are lazily defined at run-time An execution plan can be explicitly defined Use optimizer hints Due to the lack of complex optimizers In SQL: Query plans are solely decided by the system Data cannot be stored in the middle Schema and data types are defined at the creation time 21

Pig Compilation 22

Logic Plan A=LOAD 'file1' AS (x, y, z); B=LOAD 'file2' AS (t, u, v); C=FILTER A by y > 0; D=JOIN C BY x, B BY u; E=GROUP D BY z; F=FOREACH E GENERATE group, COUNT(D); STORE F INTO 'output'; LOAD FILTER LOAD JOIN GROUP FOREACH STORE

Physical Plan 1:1 correspondence with the logical plan Except for: Join, Distinct, (Co)Group, Order Several optimizations are done automatically 24

Generation of Physical Plans 25

Java vs. Pig 26

Pig References Pig Tutorial Pig Latin Reference Manual 2 Pig Latin Reference Manual 2 PigMix Queries 27

Apache Pig 28

Hive 29

Apache Hive A data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis Hive Provides ETL Structure Access to different storage (HDFS or HBase) Query execution via MapReduce Key Building Principles – SQL is a familiar language – Extensibility – Types, Functions, Formats, Scripts – Performance 30

Hive Components 31 High-level language (HiveQL) Set of commands Two execution modes Local: reads/write to local file system Mapreduce: connects to Hadoop cluster and reads/writes to HDFS Two Main Components Two modes Interactive mode Console Batch mode Submit a script

Hive deals with Structured Data Data Units Databases Tables Partitions Buckets (or clusters) 32

Hive DDL Commands 33 CREATE TABLE sample (foo INT, bar STRING) PARTITIONED BY (ds STRING); SHOW TABLES '.*s'; DESCRIBE sample; ALTER TABLE sample ADD COLUMNS (new_col INT); DROP TABLE sample; Schema is known at creation time (like DB schema) Partitioned tables have “sub-directories”, one for each partition

Hive DML LOAD DATA LOCAL INPATH './sample.txt' OVERWRITE INTO TABLE sample; LOAD DATA INPATH '/user/falvariz/hive/sample.txt’ INTO TABLE partitioned_sample PARTITION (ds=' '); 34 Load data from local file system Delete previous data from that table Load data from HDFSAugment to the existing data Must define a specific partition for partitioned tables

Hive Components 35 Hive CLI: Hive Client Interface MetaStore: For storing the schema information, data types, partitioning columns, etc… Hive QL: The query language, compiler, and executer

Data Model 3-Levels: Tables  Partitions  Buckets Table: maps to a HDFS directory Table R: Users all over the world Partition: maps to sub-directories under the table Partition R by country name It is the user’s responsibility to upload the right data to the right partition Bucket: maps to files under each partition Divide a partition into buckets based on a hash function on a certain column(s) 36

Data Model (Cont’d) 37

Query Examples I: Select & Filter SELECT foo FROM sample WHERE ds=' '; INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT * FROM sample WHERE ds=' '; INSERT OVERWRITE LOCAL DIRECTORY '/tmp/hive-sample- out' SELECT * FROM sample; 38 Create HDFS dir for the output Create local dir for the output

Query Examples II: Aggregation & Grouping SELECT MAX (foo) FROM sample; SELECT ds, COUNT (*), SUM (foo) FROM sample GROUP BY ds; FROM sample s INSERT OVERWRITE TABLE bar SELECT s.bar, count(*) WHERE s.foo > 0 GROUP BY s.bar; 39 Hive allows the From clause to come first !!! Store the results into a table

Query Examples III: Multi-Insertion FROM page_view_stg pvs INSERT OVERWRITE TABLE page_view PARTITION (dt=' ', country='US') SELECT pvs.viewTime, … WHERE pvs.country = 'US' INSERT OVERWRITE TABLE page_view PARTITION (dt=' ', country='CA') SELECT pvs.viewTime,... WHERE pvs.country = 'CA' INSERT OVERWRITE TABLE page_view PARTITION (dt=' ', country='UK') SELECT pvs.viewTime,... WHERE pvs.country = 'UK'; 40

Example IV: Joins 41 CREATE TABLE customer (id INT,name STRING,address STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '#'; CREATE TABLE order_cust (id INT,cus_id INT,prod_id INT,price INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; SELECT * FROM customer c JOIN order_cust o ON (c.id=o.cus_id); SELECT c.id, c.name, c.address, ce.exp FROM customer c JOIN (SELECT cus_id,sum(price) AS exp FROM order_cust GROUP BY cus_id ) ce ON (c.id=ce.cus_id);