CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.

CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1

Hadoop Ecosystem 2 We covered these Next week we cover more of these

Query Languages for Hadoop Java: Hadoop’s Native Language Pig: Query and Workflow Language Hive: SQL-Based Language HBase: Column-oriented Database for MapReduce 3

Java is Hadoop’s Native Language Hadoop itself is written in Java Provided Java APIs For mappers, reducers, combiners, partitioners Input and output formats Other languages, e.g., Pig or Hive, convert their queries to Java MapReduce code 4

Levels of Abstraction 5 More Hadoop visible Less Hadoop visible More map-reduce view More DB view

Java Example 6 map reduce Job conf.

Apache Pig 7

What is Pig A platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs. Compiles down to MapReduce jobs Developed by Yahoo! Open-source language 8

High-Level Language 9 raw = LOAD 'excite.log' USING PigStorage('\t') AS (user, id, time, query); clean1 = FILTER raw BY id > 20 AND id < 100; clean2 = FOREACH clean1 GENERATE user, time, org.apache.pig.tutorial.sanitze(query) as query; user_groups = GROUP clean2 BY (user, query); user_query_counts = FOREACH user_groups GENERATE group, COUNT(clean2), MIN(clean2.time), MAX(clean2.time); STORE user_query_counts INTO 'uq_counts.csv' USING PigStorage(',');

Pig Components High-level language (Pig Latin) Set of commands 10 Two execution modes Local: reads/write to local file system Mapreduce: connects to Hadoop cluster and reads/writes to HDFS Two Main Components Two modes Interactive mode Console Batch mode Submit a script

Why Pig?...Abstraction! Common design patterns as key words (joins, distinct, counts) Data flow analysis A script can map to multiple map-reduce jobs Avoids Java-level errors (not everyone can write java code) Can be interactive mode Issue commands and get results 11

Example I: More Details 12 raw = LOAD 'excite.log' USING PigStorage('\t') AS (user, id, time, query); clean1 = FILTER raw BY id > 20 AND id < 100; clean2 = FOREACH clean1 GENERATE user, time, org.apache.pig.tutorial.sanitze(query) as query; user_groups = GROUP clean2 BY (user, query); user_query_counts = FOREACH user_groups GENERATE group, COUNT(clean2), MIN(clean2.time), MAX(clean2.time); STORE user_query_counts INTO 'uq_counts.csv' USING PigStorage(','); Read file from HDFS The input format (text, tab delimited) Define run-time schema Filter the rows on predicates For each row, do some transformation Grouping of records Compute aggregation for each group Store the output in a file Text, Comma delimited

Pig: Language Features Keywords Load, Filter, Foreach Generate, Group By, Store, Join, Distinct, Order By, … Aggregations Count, Avg, Sum, Max, Min Schema Defines at query-time not when files are loaded UDFs Packages for common input/output formats 13

Example 2 A = load '$widerow' using PigStorage('\u0001') as (name: chararray, c0: int, c1: int, c2: int); B = group A by name parallel 10; C = foreach B generate group, SUM(A.c0) as c0, SUM(A.c1) as c1, AVG(A.c2) as c2; D = filter C by c0 > 100 and c1 > 100 and c2 > 100; store D into '$out'; 14 Script can take arguments Data are “ctrl-A” delimited Define types of the columns Specify the need of 10 reduce tasks

Example 3: Re-partition Join register pigperf.jar; A = load ‘page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent, query_term, timestamp, estimated_revenue); B = foreach A generate user, (double) estimated_revenue; alpha = load ’users' using PigStorage('\u0001') as (name, phone, address, city, state, zip); beta = foreach alpha generate name, city; C = join beta by name, B by user parallel 40; D = group C by $0; E = foreach D generate group, SUM(C.estimated_revenue); store E into 'L3out'; 15 Register UDFs & custom inputformats Function the jar file to read the input file Group after the join (can reference columns by position) Join the two datasets (40 reducers) Load the second file

Example 4: Replicated Join register pigperf.jar; A = load ‘page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent, query_term, timestamp, estimated_revenue); Big = foreach A generate user, (double) estimated_revenue; alpha = load ’users' using PigStorage('\u0001') as (name, phone, address, city, state, zip); small = foreach alpha generate name, city; C = join Big by user, small by name using ‘replicated’; store C into ‘out'; 16 Map-only join (the small dataset is the second)

A = LOAD 'data' AS (f1:int,f2:int,f3:int); DUMP A; (1,2,3) (4,5,6) (7,8,9) SPLIT A INTO X IF f1 6); DUMP X; (1,2,3) (4,5,6) DUMP Y; (4,5,6) STORE x INTO 'x_out'; STORE y INTO 'y_out'; STORE z INTO 'z_out'; Example 5: Multiple Outputs 17 Split the records into sets Dump command to display the data Store multiple outputs

Run independent jobs in parallel 18 D1 = load 'data1' … D2 = load 'data2' … D3 = load 'data3' … C1 = join D1 by a, D2 by b C2 = join D1 by c, D3 by d C1 and C2 are two independent jobs that can run in parallel

Pig Latin: CoGroup Combination of join and group by Make use of Pig nested structure 19

Pig Latin vs. SQL Pig Latin is procedural (dataflow programming model) Step-by-step query style is much cleaner and easier to write SQL is declarative but not step-by-step style 20 SQL Pig Latin

Pig Latin vs. SQL In Pig Latin Lazy evaluation (data not processed prior to STORE command) Data can be stored at any point during the pipeline Schema and data types are lazily defined at run-time An execution plan can be explicitly defined Use optimizer hints Due to the lack of complex optimizers In SQL: Query plans are solely decided by the system Data cannot be stored in the middle Schema and data types are defined at the creation time 21

Pig Compilation 22

Logic Plan A=LOAD 'file1' AS (x, y, z); B=LOAD 'file2' AS (t, u, v); C=FILTER A by y > 0; D=JOIN C BY x, B BY u; E=GROUP D BY z; F=FOREACH E GENERATE group, COUNT(D); STORE F INTO 'output'; LOAD FILTER LOAD JOIN GROUP FOREACH STORE

Physical Plan 1:1 correspondence with the logical plan Except for: Join, Distinct, (Co)Group, Order Several optimizations are done automatically 24

Generation of Physical Plans 25

Java vs. Pig 26

Pig References Pig Tutorial http://pig.apache.org/docs/r0.7.0/tutorial.html Pig Latin Reference Manual 2 http://pig.apache.org/docs/r0.7.0/piglatin_ref1.html Pig Latin Reference Manual 2 http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html PigMix Queries https://cwiki.apache.org/PIG/pigmix.html 27

Apache Pig 28

Hive 29

Apache Hive A data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis Hive Provides ETL Structure Access to different storage (HDFS or HBase) Query execution via MapReduce Key Building Principles – SQL is a familiar language – Extensibility – Types, Functions, Formats, Scripts – Performance 30

Hive Components 31 High-level language (HiveQL) Set of commands Two execution modes Local: reads/write to local file system Mapreduce: connects to Hadoop cluster and reads/writes to HDFS Two Main Components Two modes Interactive mode Console Batch mode Submit a script

Hive deals with Structured Data Data Units Databases Tables Partitions Buckets (or clusters) 32

Hive DDL Commands 33 CREATE TABLE sample (foo INT, bar STRING) PARTITIONED BY (ds STRING); SHOW TABLES '.*s'; DESCRIBE sample; ALTER TABLE sample ADD COLUMNS (new_col INT); DROP TABLE sample; Schema is known at creation time (like DB schema) Partitioned tables have “sub-directories”, one for each partition

Hive DML LOAD DATA LOCAL INPATH './sample.txt' OVERWRITE INTO TABLE sample; LOAD DATA INPATH '/user/falvariz/hive/sample.txt’ INTO TABLE partitioned_sample PARTITION (ds='2012-02-24'); 34 Load data from local file system Delete previous data from that table Load data from HDFSAugment to the existing data Must define a specific partition for partitioned tables

Hive Components 35 Hive CLI: Hive Client Interface MetaStore: For storing the schema information, data types, partitioning columns, etc… Hive QL: The query language, compiler, and executer

Data Model 3-Levels: Tables  Partitions  Buckets Table: maps to a HDFS directory Table R: Users all over the world Partition: maps to sub-directories under the table Partition R by country name It is the user’s responsibility to upload the right data to the right partition Bucket: maps to files under each partition Divide a partition into buckets based on a hash function on a certain column(s) 36

Data Model (Cont’d) 37

Query Examples I: Select & Filter SELECT foo FROM sample WHERE ds='2012-02-24'; INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT * FROM sample WHERE ds='2012-02-24'; INSERT OVERWRITE LOCAL DIRECTORY '/tmp/hive-sample- out' SELECT * FROM sample; 38 Create HDFS dir for the output Create local dir for the output

Query Examples II: Aggregation & Grouping SELECT MAX (foo) FROM sample; SELECT ds, COUNT (*), SUM (foo) FROM sample GROUP BY ds; FROM sample s INSERT OVERWRITE TABLE bar SELECT s.bar, count(*) WHERE s.foo > 0 GROUP BY s.bar; 39 Hive allows the From clause to come first !!! Store the results into a table

Query Examples III: Multi-Insertion FROM page_view_stg pvs INSERT OVERWRITE TABLE page_view PARTITION (dt='2008-06-08', country='US') SELECT pvs.viewTime, … WHERE pvs.country = 'US' INSERT OVERWRITE TABLE page_view PARTITION (dt='2008-06-08', country='CA') SELECT pvs.viewTime,... WHERE pvs.country = 'CA' INSERT OVERWRITE TABLE page_view PARTITION (dt='2008-06-08', country='UK') SELECT pvs.viewTime,... WHERE pvs.country = 'UK'; 40

Example IV: Joins 41 CREATE TABLE customer (id INT,name STRING,address STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '#'; CREATE TABLE order_cust (id INT,cus_id INT,prod_id INT,price INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; SELECT * FROM customer c JOIN order_cust o ON (c.id=o.cus_id); SELECT c.id, c.name, c.address, ce.exp FROM customer c JOIN (SELECT cus_id,sum(price) AS exp FROM order_cust GROUP BY cus_id ) ce ON (c.id=ce.cus_id);

CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.

Similar presentations

Presentation on theme: "CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.

Similar presentations

Presentation on theme: "CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1."— Presentation transcript:

Similar presentations

About project

Feedback