CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.

Slides:



Advertisements
Similar presentations
Introduction to Apache HIVE
Advertisements

Alan F. Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M. Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, Utkarsh.
Hui Li Pig Tutorial Hui Li Some material adapted from slides by Adam Kawa the 3rd meeting of WHUG June 21, 2012.
Hadoop Pig By Ravikrishna Adepu.
CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.
© Hortonworks Inc Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop.
Working with pig Cloud computing lecture. Purpose  Get familiar with the pig environment  Advanced features  Walk though some examples.
High Level Language: Pig Latin Hui Li Judy Qiu Some material adapted from slides by Adam Kawa the 3 rd meeting of WHUG June 21, 2012.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture VII: 2014/04/21.
Pig Contributors Workshop Agenda Introductions What we are working on Usability Howl TLP Lunch Turing Completeness Workflow Fun (Bocci ball)
Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olsten, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Acknowledgement.
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
Hive - A Warehousing Solution Over a Map-Reduce Framework.
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Hive: A data warehouse on Hadoop
(Hadoop) Pig Dataflow Language B. Ramamurthy Based on Cloudera’s tutorials and Apache’s Pig Manual 6/27/2015.
A warehouse solution over map-reduce framework Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff.
Raghav Ayyamani. Copyright Ellis Horowitz, Why Another Data Warehousing System? Problem : Data, data and more data Several TBs of data everyday.
Hive – A Warehousing Solution Over a Map-Reduce Framework Presented by: Atul Bohara Feb 18, 2014.
CERN IT Department CH-1211 Geneva 23 Switzerland t XLDB 2010 (Extremely Large Databases) conference summary Dawid Wójcik.
Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.
HADOOP ADMIN: Session -2
Pig Acknowledgement: Modified slides from Duke University 04/13/10 Cloud Computing Lecture.
Hive : A Petabyte Scale Data Warehouse Using Hadoop
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Cloud Computing Other High-level parallel processing languages Keke Chen.
NoSQL continued CMSC 461 Michael Wilson. MongoDB  MongoDB is another NoSQL solution  Provides a bit more structure than a solution like Accumulo  Data.
Big Data Analytics Training
Pig Latin CS 6800 Utah State University. Writing MapReduce Jobs Higher order functions Map applies a function to a list Example list [1, 2, 3, 4] Want.
Introduction to Hadoop and HDFS
Distributed Systems Fall 2014 Zubair Amjad. Outline Motivation What is Sqoop? How Sqoop works? Sqoop Architecture Import Export Sqoop Connectors Sqoop.
Hive Facebook 2009.
Storage and Analysis of Tera-scale Data : 2 of Database Class 11/24/09
MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.
An Introduction to HDInsight June 27 th,
Big Data for Relational Practitioners Len Wyatt Program Manager Microsoft Corporation DBI225.
A NoSQL Database - Hive Dania Abed Rabbou.
Presented by Priagung Khusumanegara Prof. Kyungbaek Kim
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
MAP-REDUCE ABSTRACTIONS 1. Abstractions On Top Of Hadoop We’ve decomposed some algorithms into a map-reduce “workflow” (series of map-reduce steps) –
Hive. What is Hive? Data warehousing layer on top of Hadoop – table abstractions SQL-like language (HiveQL) for “batch” data processing SQL is translated.
Pig, a high level data processing system on Hadoop Gang Luo Nov. 1, 2010.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.
Apache Pig CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.
What is Pig ???. Why Pig ??? MapReduce is difficult to program. It only has two phases. Put the logic at the phase. Too many lines of code even for simple.
HIVE A Warehousing Solution Over a MapReduce Framework
Mail call Us: / / Hadoop Training Sathya technologies is one of the best Software Training Institute.
Pig, Making Hadoop Easy Alan F. Gates Yahoo!.
Hadoop.
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Spark Presentation.
A Warehousing Solution Over a Map-Reduce Framework
Hive Mr. Sriram
Pig Latin - A Not-So-Foreign Language for Data Processing
Hadoop EcoSystem B.Ramamurthy.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Slides borrowed from Adam Shook
Pig - Hive - HBase - Zookeeper
CSE 491/891 Lecture 21 (Pig).
CSE 491/891 Lecture 24 (Hive).
Charles Tappert Seidenberg School of CSIS, Pace University
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
(Hadoop) Pig Dataflow Language
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Pig Hive HBase Zookeeper
Presentation transcript:

CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1

Languages for Hadoop Java: Hadoop’s Native Language Pig (Yahoo): Query and Workflow Language unstructured data Hive (Facebook): SQL-Based Language structured data/ data warehousing 2

Java is Hadoop’s Native Language Hadoop itself is written in Java Provides Java APIs: For mappers, reducers, combiners, partitioners Input and output formats Other languages, e.g., Pig or Hive, convert their queries to Java MapReduce code 3

Levels of Abstraction 4 More map-reduce view More DB view

Apache Pig 5

What is Pig ? High-level language and associated platform for expressing data analysis programs. Compiles down to MapReduce jobs Developed by Yahoo but open-source 6

Pig : High-Level Language 7 raw = LOAD 'excite.log' USING PigStorage('\t') AS (user, id, time, query); clean1 = FILTER raw BY id > 20 AND id < 100; clean2 = FOREACH clean1 GENERATE user, time, org.apache.pig.tutorial.sanitze(query) as query; user_groups = GROUP clean2 BY (user, query); user_query_counts = FOREACH user_groups GENERATE group, COUNT(clean2), MIN(clean2.time), MAX(clean2.time); STORE user_query_counts INTO 'uq_counts.csv' USING PigStorage(',');

Pig Components High-level language (Pig Latin) Set of commands 8 Two execution modes Local: reads/write to local file system Mapreduce: connects to Hadoop cluster and reads/writes to HDFS Two Main Components Two modes Interactive mode Console Batch mode Submit a script

Why Pig? Common design patterns as key words (joins, distinct, counts) Data flow analysis (script can map to multiple map- reduce jobs) Avoid Java-level errors (for none-java experts) Interactive mode (Issue commands and get results) 9

Example 10 raw = LOAD 'excite.log' USING PigStorage('\t') AS (user, id, time, query); clean1 = FILTER raw BY id > 20 AND id < 100; clean2 = FOREACH clean1 GENERATE user, time, org.apache.pig.tutorial.sanitze(query) as query; user_groups = GROUP clean2 BY (user, query); user_query_counts = FOREACH user_groups GENERATE group, COUNT(clean2), MIN(clean2.time), MAX(clean2.time); STORE user_query_counts INTO 'uq_counts.csv' USING PigStorage(','); Read file from HDFS The input format (text, tab delimited) Define run-time schema Filter the rows on predicates For each row, do some transformation Grouping of records Compute aggregation for each group Store the output in a file Text, Comma delimited

Pig Language Keywords Load, Filter, Foreach Generate, Group By, Store, Join, Distinct, Order By, … Aggregations Count, Avg, Sum, Max, Min Schema Defines at query-time and not when files are loaded Extension of Logic UDFs Data Packages for common input/output formats 11

A Parameterized Template A = load '$widerow' using PigStorage('\u0001') as (name: chararray, c0: int, c1: int, c2: int); B = group A by name parallel 10; C = foreach B generate group, SUM(A.c0) as c0, SUM(A.c1) as c1, AVG(A.c2) as c2; D = filter C by c0 > 100 and c1 > 100 and c2 > 100; store D into '$out'; 12 Script can take arguments Data are “ctrl-A” delimited Define types of the columns Specify the need of 10 parallel tasks

Run independent jobs in parallel 13 D1 = load 'data1' … D2 = load 'data2' … D3 = load 'data3' … C1 = join D1 by a, D2 by b C2 = join D1 by c, D3 by d C1 and C2 are two independent jobs that can run in parallel

Pig Latin vs. SQL Pig Latin is dataflow programming model (step-by-step) SQL is declarative (set-based approach) 14 SQL Pig Latin

Pig Latin vs. SQL In Pig Latin An execution plan can be explicitly defined (user hints but no clever opt) Data can be stored at any point during the pipeline Schema and data types are lazily defined at run-time Lazy evaluation (data not processed prior to STORE command) In SQL: Query plans are decided by the system (powerful opt) Data not stored in the middle (or, at least not user-accessible) Schema and data types are defined at creation time 15

Pig Latin vs. SQL In Pig Latin An execution plan can be explicitly defined (user hints but no clever opt) Data can be stored at any point during the pipeline Schema and data types are lazily defined at run-time Lazy evaluation (data not processed prior to STORE command) In SQL: Query plans are decided by the system (powerful opt) Data not stored in the middle (or, at least not user-accessible) Schema and data types are defined at creation time 16

Logic Plan A=LOAD 'file1' AS (x, y, z); B=LOAD 'file2' AS (t, u, v); C=FILTER A by y > 0; D=JOIN C BY x, B BY u; E=GROUP D BY z; F=FOREACH E GENERATE group, COUNT(D); STORE F INTO 'output'; LOAD FILTER LOAD JOIN GROUP FOREACH STORE

Physical Plan Mostly 1:1 correspondence with the logical plan Some optimizations available 18

Hive 19

Apache Hive (Facebook) A data warehouse infrastructure built on top of Hadoop for providing data summarization, retrieval, and analysis Hive Provides : Structure ETL Access to different storage (HDFS or HBase) Query execution via MapReduce Key Principles : – SQL is a familiar language – Extensibility – Types, Functions, Formats, Scripts – Performance 20

Hive Components 21 High-level language (HiveQL) Set of commands Two execution modes Local: reads/write to local file system Mapreduce: connects to Hadoop cluster and reads/writes to HDFS Two Main Components Two modes Interactive mode Console Batch mode Submit a script

Hive Data Model : Structured 3-Levels: Tables  Partitions  Buckets Table: maps to a HDFS directory Table R: Users all over the world Partition: maps to sub-directories under the table Partition R: by country name Bucket: maps to files under each partition Divide a partition into buckets based on a hash function 22

Hive DDL Commands 23 CREATE TABLE sample (foo INT, bar STRING) PARTITIONED BY (ds STRING); SHOW TABLES '.*s'; DESCRIBE sample; ALTER TABLE sample ADD COLUMNS (new_col INT); DROP TABLE sample; Schema is known at creation time (like DB schema) Partitioned tables have “sub-directories”, one for each partition Each table in HIVE is HDFS directory in Hadoop

HiveQL: Hive DML Commands LOAD DATA LOCAL INPATH './sample.txt' OVERWRITE INTO TABLE sample; LOAD DATA INPATH '/user/falvariz/hive/sample.txt’ INTO TABLE partitioned_sample PARTITION (ds=' '); 24 Load data from local file system Delete previous data from that table Load data from HDFSAugment to the existing data Must define a specific partition for partitioned tables

Query Examples SELECT MAX (foo) FROM sample; SELECT ds, COUNT (*), SUM (foo) FROM sample GROUP BY ds; FROM sample s INSERT OVERWRITE TABLE bar SELECT s.bar, count(*) WHERE s.foo > 0 GROUP BY s.bar; SELECT * FROM customer c JOIN order_cust o ON (c.id=o.cus_id); 25

User-Defined Functions 26

Hadoop Streaming Utility Hadoop streaming is a utility to create and run map/reduce jobs with any executable or script as the mapper and/or the reducer C, Python, Java, Ruby, C#, perl, shell commands Map and Reduce classes written in different languages 27

Summary : Languages Java: Hadoop’s Native Language Pig (Yahoo): Query/Workflow Language unstructured data Hive (Facebook): SQL-Based Language structured data 28