Raghav Ayyamani. Copyright Ellis Horowitz, 2011 - 20122 Why Another Data Warehousing System? Problem : Data, data and more data Several TBs of data everyday.

Slides:



Advertisements
Similar presentations
Introduction to Apache HIVE
Advertisements

Hive Index Yongqiang He Software Engineer Facebook Data Infrastructure Team.
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.
Hive - A Warehousing Solution Over a Map-Reduce Framework.
HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
O’Reilly – Hadoop: The Definitive Guide Ch.5 Developing a MapReduce Application 2 July 2010 Taewhi Lee.
Reynold Xin Shark: Hive (SQL) on Spark. Stage 0: Map-Shuffle-Reduce Mapper(row) { fields = row.split("\t") emit(fields[0], fields[1]); } Reducer(key,
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Clydesdale: Structured Data Processing on MapReduce Jackie.
Introduction to Hive Liyin Tang
Hive: A data warehouse on Hadoop
Chapter 7 Managing Data Sources. ASP.NET 2.0, Third Edition2.
HIVE Data Warehousing & Analytics on Hadoop Joydeep Sen Sarma, Ashish Thusoo Facebook Data Team.
CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.
A warehouse solution over map-reduce framework Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff.
Hive – A Warehousing Solution Over a Map-Reduce Framework Presented by: Atul Bohara Feb 18, 2014.
SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL.
Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.
HADOOP ADMIN: Session -2
Processing Data using Amazon Elastic MapReduce and Apache Hive Team Members Frank Paladino Aravind Yeluripiti.
DLRL Cluster Matt Bollinger, Joseph Pontani, Adam Lech Client: Sunshin Lee CS4624 Capstone Project March 3, 2014 Virginia Tech, Blacksburg, VA.
Analytics Map Reduce Query Insight Hive Pig Hadoop SQL Map Reduce Business Intelligence Predictive Operational Interactive Visualization Exploratory.
Hive : A Petabyte Scale Data Warehouse Using Hadoop
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Cloud Computing Other High-level parallel processing languages Keke Chen.
Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from systems Batch Processing High Throughput Partition-able problems Fault Tolerance.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Penwell Debug Intel Confidential BRIEF OVERVIEW OF HIVE Jonathan Brauer ESE 380L Feb
Distributed Systems Fall 2014 Zubair Amjad. Outline Motivation What is Sqoop? How Sqoop works? Sqoop Architecture Import Export Sqoop Connectors Sqoop.
Hive Facebook 2009.
QMapper for Smart Grid: Migrating SQL-based Application to Hive Yue Wang, Yingzhong Xu, Yue Liu, Jian Chen and Songlin Hu SIGMOD’15, May 31–June 4, 2015.
MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
An Introduction to HDInsight June 27 th,
Hive – A Warehousing Solution Over a MapReduce Framework Bingbing Liu
A NoSQL Database - Hive Dania Abed Rabbou.
Hive – SQL on top of Hadoop
SLIDE 1IS 257 – Fall 2014 NewSQL and VoltDB University of California, Berkeley School of Information IS 257: Database Management.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Hive. What is Hive? Data warehousing layer on top of Hadoop – table abstractions SQL-like language (HiveQL) for “batch” data processing SQL is translated.
IBM Research ® © 2007 IBM Corporation A Brief Overview of Hadoop Eco-System.
Impala. Impala: Goals General-purpose SQL query engine for Hadoop High performance – C++ implementation – runtime code generation (using LLVM) – direct.
Basics of JDBC Session 14.
CP476 Internet Computing Perl CGI and MySql 1 Relational Databases –A database is a collection of data organized to allow relatively easy access for retrievals,
Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.
HIVE Data Warehousing & Analytics on Hadoop Facebook Data Team.
Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.
Apache Hive CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.
1 Copyright © 2008, Oracle. All rights reserved. Repository Basics.
HIVE – A PETABYTE SCALE DATA WAREHOUSE USING HADOOP -Abhilash Veeragouni Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning.
Image taken from: slideshare
”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.
HIVE A Warehousing Solution Over a MapReduce Framework
Mail call Us: / / Hadoop Training Sathya technologies is one of the best Software Training Institute.
Hadoop.
Hive - A Warehousing Solution Over a Map-Reduce Framework
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
A Warehousing Solution Over a Map-Reduce Framework
Hive Mr. Sriram
SQOOP.
Hadoop EcoSystem B.Ramamurthy.
Rekha Singhal, Amol Khanapurkar, TCS Mumbai.
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Server & Tools Business
Pig - Hive - HBase - Zookeeper
Data Warehousing in the age of Big Data (1)
CSE 491/891 Lecture 24 (Hive).
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Pig Hive HBase Zookeeper
Presentation transcript:

Raghav Ayyamani

Copyright Ellis Horowitz, Why Another Data Warehousing System? Problem : Data, data and more data Several TBs of data everyday The Hadoop Experiment: Uses Hadoop File System (HDFS) Scalable/Available Problem Lacked Expressiveness Map-Reduce hard to program Solution : HIVE

Copyright Ellis Horowitz, What is HIVE? A system for managing and querying unstructured data as if it were structured Uses Map-Reduce for execution HDFS for Storage Key Building Principles SQL as a familiar data warehousing tool Extensibility (Pluggable map/reduce scripts in the language of your choice, Rich and User Defined Data Types, User Defined Functions) Interoperability (Extensible Framework to support different file and data formats) Performance

Copyright Ellis Horowitz, Type System Primitive types – Integers:TINYINT, SMALLINT, INT, BIGINT. – Boolean: BOOLEAN. – Floating point numbers: FLOAT, DOUBLE. – String: STRING. Complex types – Structs: {a INT; b INT}. – Maps: M['group']. – Arrays: ['a', 'b', 'c'], A[1] returns 'b'.

Copyright Ellis Horowitz, Data Model- Tables Tables Analogous to tables in relational DBs. Each table has corresponding directory in HDFS. Example Page view table name – pvs HDFS directory /wh/pvs Example: CREATE TABLE t1(ds string, ctry float, li list<map<string, struct >);

Copyright Ellis Horowitz, Data Model - Partitions Partitions Analogous to dense indexes on partition columns Nested sub-directories in HDFS for each combination of partition column values. Allows users to efficiently retrieve rows Example Partition columns: ds, ctry HDFS for ds= , ctry=US /wh/pvs/ds= /ctry=US HDFS for ds= , ctry=IN /wh/pvs/ds= /ctry=IN

Copyright Ellis Horowitz, Hive Query Language –Contd. Partitioning – Creating partitions CREATE TABLE test_part(ds string, hr int) PARTITIONED BY (ds string, hr int); INSERT OVERWRITE TABLE test_part PARTITION(ds=' ', hr=12) SELECT * FROM t; ALTER TABLE test_part ADD PARTITION(ds=' ', hr=11);

Copyright Ellis Horowitz, Partitioning - Contd.. SELECT * FROM test_part WHERE ds=' '; will only scan all the files within the /user/hive/warehouse/test_part/ds= directory SELECT * FROM test_part WHERE ds=' ' AND hr=11; will only scan all the files within the /user/hive/warehouse/test_part/ds= /hr=11 directory.

Copyright Ellis Horowitz, Data Model Buckets Split data based on hash of a column – mainly for parallelism Data in each partition may in turn be divided into Buckets based on the value of a hash function of some column of a table. Example Bucket column: user into 32 buckets HDFS file for user hash 0 /wh/pvs/ds= /cntr=US/part HDFS file for user hash bucket 20 /wh/pvs/ds= /cntr=US/part-00020

Copyright Ellis Horowitz, Data Model External Tables Point to existing data directories in HDFS Can create table and partitions Data is assumed to be in Hive-compatible format Dropping external table drops only the metadata Example: create external table CREATE EXTERNAL TABLE test_extern(c1 string, c2 int) LOCATION '/user/mytables/mydata';

Copyright Ellis Horowitz, Serialization/Deserialization Generic (De)Serialzation Interface SerDe Uses LazySerDe Flexibile Interface to translate unstructured data into structured data Designed to read data separated by different delimiter characters The SerDes are located in 'hive_contrib.jar';

Copyright Ellis Horowitz, Hive File Formats Hive lets users store different file formats Helps in performance improvements SQL Example: CREATE TABLE dest1(key INT, value STRING) STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.SequenceFileInputFormat' OUTPUTFORMAT 'org.apache.hadoop.mapred.SequenceFileOutputFormat' 12

Copyright Ellis Horowitz, System Architecture and Components

Driver (Compiler, Optimizer, Executor) Driver (Compiler, Optimizer, Executor) System Architecture and Components Metastore The component that store the system catalog and meta data about tables, columns, partitions etc. Stored on a traditional RDBMS Command Line Interface Web Interface Web Interface Thrift Server Metastore JDBC ODBC

System Architecture and Components Driver The component that manages the lifecycle of a HiveQL statement as it moves through Hive. The driver also maintains a session handle and any session statistics. Command Line Interface Web Interface Web Interface Thrift Server Metastore JDBC ODBC Driver (Compiler, Optimizer, Executor) Driver (Compiler, Optimizer, Executor)

System Architecture and Components Query Compiler The component that compiles HiveQL into a directed acyclic graph of map/reduce tasks. Command Line Interface Web Interface Web Interface Thrift Server Metastore JDBC ODBC Driver (Compiler, Optimizer, Executor) Driver (Compiler, Optimizer, Executor)

System Architecture and Components Optimizer consists of a chain of transformations such that the operator DAG resulting from one transformation is passed as input to the next transformation Performs tasks like Column Pruning, Partition Pruning, Repartitioning of Data Command Line Interface Web Interface Web Interface Thrift Server Metastore JDBC ODBC Driver (Compiler, Optimizer, Executor) Driver (Compiler, Optimizer, Executor)

System Architecture and Components Execution Engine The component that executes the tasks produced by the compiler in proper dependency order. The execution engine interacts with the underlying Hadoop instance. Command Line Interface Web Interface Web Interface Thrift Server Metastore JDBC ODBC Driver (Compiler, Optimizer, Executor) Driver (Compiler, Optimizer, Executor)

Driver (Compiler, Optimizer, Executor) Driver (Compiler, Optimizer, Executor) System Architecture and Components HiveServer The component that provides a trift interface and a JDBC/ODBC server and provides a way of integrating Hive with other applications. Command Line Interface Web Interface Web Interface Metastore JDBC ODBC Thrift Server

Driver (Compiler, Optimizer, Executor) Driver (Compiler, Optimizer, Executor) System Architecture and Components Client Components Client component like Command Line Interface(CLI), the web UI and JDBC/ODBC driver. Command Line Interface Web Interface Web Interface Thrift Server Metastore JDBC ODBC

Copyright Ellis Horowitz, Hive Query Language Basic SQL From clause sub-query ANSI JOIN (equi-join only) Multi-Table insert Multi group-by Sampling Objects Traversal Extensibility Pluggable Map-reduce scripts using TRANSFORM

Copyright Ellis Horowitz, Hive Query Language JOIN SELECT t1.a1 as c1, t2.b1 as c2 FROM t1 JOIN t2 ON (t1.a2 = t2.b2); INSERTION INSERT OVERWRITE TABLE t1 SELECT * FROM t2;

Copyright Ellis Horowitz, Hive Query Language –Contd. Insertion INSERT OVERWRITE TABLE sample1 '/tmp/hdfs_out' SELECT * FROM sample WHERE ds=' '; INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT * FROM sample WHERE ds=' '; INSERT OVERWRITE LOCAL DIRECTORY '/tmp/hive-sample-out' SELECT * FROM sample;

Copyright Ellis Horowitz, Hive Query Language –Contd. Map Reduce FROM (MAP doctext USING 'python wc_mapper.py' AS (word, cnt) FROM docs CLUSTER BY word ) REDUCE word, cnt USING 'python wc_reduce.py'; FROM (FROM session_table SELECT sessionid, tstamp, data DISTRIBUTE BY sessionid SORT BY tstamp ) REDUCE sessionid, tstamp, data USING 'session_reducer.sh';

Copyright Ellis Horowitz, Hive Query Language Example of multi-table insert query and its optimization FROM (SELECT a.status, b.school, b.gender FROM status_updates a JOIN profiles b ON (a.userid = b.userid AND a.ds=' ' )) subq1 INSERT OVERWRITE TABLE gender_summary PARTITION(ds=' ') SELECT subq1.gender, COUNT(1) GROUP BY subq1.gender INSERT OVERWRITE TABLE school_summary PARTITION(ds=' ') SELECT subq1.school, COUNT(1) GROUP BY subq1.school

Hive Query Language

Copyright Ellis Horowitz, Related Work Scope Pig HadoopDB 27

Copyright Ellis Horowitz, Conclusion Pros Good explanation of Hive and HiveQL with proper examples Architecture is well explained Usage of Hive is properly given Cons Accepts only a subset of SQL queries Performance comparisons with other systems would have been more appreciable