A warehouse solution over map-reduce framework Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff.

Slides:



Advertisements
Similar presentations
CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.
Advertisements

Hive - A Warehousing Solution Over a Map-Reduce Framework.
HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
BigData Tools Seyyed mohammad Razavi. Outline  Introduction  Hbase  Cassandra  Spark  Acumulo  Blur  MongoDB  Hive  Giraph  Pig.
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Introduction to Hive Liyin Tang
Hive: A data warehouse on Hadoop
©Silberschatz, Korth and Sudarshan1.1Database System Concepts Chapter 1: Introduction Purpose of Database Systems View of Data Data Models Data Definition.
ETM Hadoop. ETM IDC estimate put the size of the “digital universe” at zettabytes in forecasting a tenfold growth by 2011 to.
PARALLEL DBMS VS MAP REDUCE “MapReduce and parallel DBMSs: friends or foes?” Stonebraker, Daniel Abadi, David J Dewitt et al.
HIVE Data Warehousing & Analytics on Hadoop Joydeep Sen Sarma, Ashish Thusoo Facebook Data Team.
UT DALLAS Erik Jonsson School of Engineering & Computer Science FEARLESS engineering Secure Data Storage and Retrieval in the Cloud Bhavani Thuraisingham,
CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.
Session-01. Hibernate Framework ? Why we use Hibernate ?
Raghav Ayyamani. Copyright Ellis Horowitz, Why Another Data Warehousing System? Problem : Data, data and more data Several TBs of data everyday.
Hive – A Warehousing Solution Over a Map-Reduce Framework Presented by: Atul Bohara Feb 18, 2014.
Overview of Database Access in.Net Josh Bowen CIS 764-FS2008.
Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.
HADOOP ADMIN: Session -2
Advance Computer Programming Java Database Connectivity (JDBC) – In order to connect a Java application to a database, you need to use a JDBC driver. –
Facebook (stylized facebook) is a Social Networking System and website launched in February 2004, operated and privately owned by Facebook, Inc. As.
Hive : A Petabyte Scale Data Warehouse Using Hadoop
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Cloud Computing Other High-level parallel processing languages Keke Chen.
NoSQL continued CMSC 461 Michael Wilson. MongoDB  MongoDB is another NoSQL solution  Provides a bit more structure than a solution like Accumulo  Data.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Penwell Debug Intel Confidential BRIEF OVERVIEW OF HIVE Jonathan Brauer ESE 380L Feb
Introduction to Hadoop and HDFS
Distributed Systems Fall 2014 Zubair Amjad. Outline Motivation What is Sqoop? How Sqoop works? Sqoop Architecture Import Export Sqoop Connectors Sqoop.
Hive Facebook 2009.
QMapper for Smart Grid: Migrating SQL-based Application to Hive Yue Wang, Yingzhong Xu, Yue Liu, Jian Chen and Songlin Hu SIGMOD’15, May 31–June 4, 2015.
MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.
An Introduction to HDInsight June 27 th,
A NoSQL Database - Hive Dania Abed Rabbou.
Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.
Hive – SQL on top of Hadoop
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
Hive. What is Hive? Data warehousing layer on top of Hadoop – table abstractions SQL-like language (HiveQL) for “batch” data processing SQL is translated.
Copyright © 2015, SAS Institute Inc. All rights reserved. THE ELEPHANT IN THE ROOM SAS & HADOOP.
IBM Research ® © 2007 IBM Corporation A Brief Overview of Hadoop Eco-System.
Impala. Impala: Goals General-purpose SQL query engine for Hadoop High performance – C++ implementation – runtime code generation (using LLVM) – direct.
1 Data Architecture Strawman - Grimshaw Important points Everything is a service (object) >All have a name (EPR) and an interface (type) One or more base.
Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.
Moscow, November 16th, 2011 The Hadoop Ecosystem Kai Voigt, Cloudera Inc.
BIG DATA. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database.
1 Copyright © 2008, Oracle. All rights reserved. Repository Basics.
Data Cleansing with Pig Latin. Neubot Tests Data Structure.
Hive Big data for CSci 4707 students! Eric Atherton and Henry Hoang.
HIVE – A PETABYTE SCALE DATA WAREHOUSE USING HADOOP -Abhilash Veeragouni Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning.
HIVE A Warehousing Solution Over a MapReduce Framework
Hadoop.
Hive - A Warehousing Solution Over a Map-Reduce Framework
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
A Warehousing Solution Over a Map-Reduce Framework
Hive Mr. Sriram
SQOOP.
Pig Latin - A Not-So-Foreign Language for Data Processing
Hadoop EcoSystem B.Ramamurthy.
Rekha Singhal, Amol Khanapurkar, TCS Mumbai.
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Server & Tools Business
Pig - Hive - HBase - Zookeeper
Data Warehousing in the age of Big Data (1)
Charles Tappert Seidenberg School of CSIS, Pace University
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Pig Hive HBase Zookeeper
Presentation transcript:

A warehouse solution over map-reduce framework Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff and Raghotham Murthy Dony Ang 8/13/20151 HIVE - A warehouse solution over Map Reduce Framework

overview  background  what is Hive  Hive DB  Hive architecture  Hive datatypes  hiveQL  hive components  execution flows  compiler in details  pros and cons  conclusion 8/13/20152 HIVE - A warehouse solution over Map Reduce Framework

background  Size of collected and analyzed datasets for business intelligence is growing rapidly, making traditional warehousing more $$$  Hadoop is a popular open source map- reduce as an alternative to store and process extremely large data sets on commodity hardware  However, map reduce itself is very low-level and required developers to write custom code. 8/13/20153 HIVE - A warehouse solution over Map Reduce Framework

General Ecosystem of DW 8/13/2015 HIVE - A warehouse solution over Map Reduce Framework4 Hadoop M / R Reporting / BI layer SQL ETL SQL

what is hive ?  Open-source DW solution built on top of Hadoop  Support SQL-like declarative language called HiveQL which are compiled into map-reduce jobs executed on Hadoop  Also support custom map-reduce script to be plugged into query.  Includes a system catalog, Hive Metastore for query optimizations and data exploration 8/13/20155 HIVE - A warehouse solution over Map Reduce Framework

Hive Database  Data Model Tables ○ Analogous to tables in relational database ○ Each table has a corresponding HDFS dir ○ Data is serialized and stored in files within dir ○ Support external tables on data stored in HDFS, NFS or local directory. Partitions can have 1 or more partitions (1-level) which determine the distribution of data within subdirectories of table directory. 8/13/20156 HIVE - A warehouse solution over Map Reduce Framework

HIVE Database cont. e.q : Table T under /wh/T and is partitioned on column ds + ctry For ds= ctry=US Then data is stored within dir /wh/T/ds= /ctry=US Buckets ○ Data in each partition are divided into buckets based on hash of a column in the table. Each bucket is stored as a file in the partition directory. 8/13/20157 HIVE - A warehouse solution over Map Reduce Framework

HIVE datatype  Support primitive column types Integer Floating point Strings Date Boolean  As well as nestable collections such as array or map  User can also define their own type programmatically 8/13/20158 HIVE - A warehouse solution over Map Reduce Framework

hiveQL  Support SQL-like query language called HiveQL for select,join, aggregate, union all and sub-query in the from clause  Support DDL stmt such as CREATE table with serialization format, partitioning and bucketing columns  Command to load data from external sources and INSERT into HIVE tables. LOAD DATA LOCAL INPATH ‘/logs/status_updates’ INTO TABLE status_updates PARTITION (ds=‘ ’)  DO NOT support UPDATE and DELETE 8/13/20159 HIVE - A warehouse solution over Map Reduce Framework

hiveQL cont.  Support multi-table INSERT FROM (SELECT a.status, b.schoold, b.gender FROM status_updates a JOIN profiles b ON (a..userid = b.userid) and a.ds=‘ ’) ) subq1 INSERT OVERWRITE TABLE gender_summary PARTITION (ds=‘ ’) SELECT subq1.gender,COUNT(1) GROUP BY subq1.gender INSERT OVERWRITE TABLE school_summary PARTITION (ds=‘ ’) SELECT subq.school, COUNT(1) GROUP BY subq1.school  Also support User-defined column transformation (UDF) and aggregation (UDAF) function written in Java 8/13/ HIVE - A warehouse solution over Map Reduce Framework

HIVE Architecture 8/13/ HIVE - A warehouse solution over Map Reduce Framework

HIVE Components  External Interfaces User Interfaces both CLI and Web UI and API likes JDBC and ODBC.  Hive Thrift Server simple client API to execute HiveQL statements  Metastore – system catalog  Driver Manages the lifecycle of HiveQL for compilation, optimization and execution. 8/13/ HIVE - A warehouse solution over Map Reduce Framework

Execution Flow 8/13/ HIVE - A warehouse solution over Map Reduce Framework

Compiler in details  When driver invokes compiler with HiveQL, the compiler converts string into a plan.  Plan can be Metadata operation for DDL statement HDFS operation for LOAD statement For Insert / Queries consists of DAG (Directed Acyclic Graph) of map-reduce jobs. 8/13/ HIVE - A warehouse solution over Map Reduce Framework

Compiler cont.  Parser transform query into a parse tree representation  Semantic Analyzer transform parse tree to a block-based internal query representation – retrieve schema information of the input table from metastore and verifies the column names, expand SELECT * and does type-checking including implicit type conversions 8/13/ HIVE - A warehouse solution over Map Reduce Framework

Compiler cont.  Physical Plan Generator converts logical plan into physical plan consisting of DAG of map-reduce jobs 8/13/ HIVE - A warehouse solution over Map Reduce Framework

Compiler cont  Logical Plan Generator converts internal query representation to a logical plan consists of a tree of logical operators.  Optimizer perform multiple passes over logical plan and rewrites in several ways Combine multiple joins which share the join key into a single multi-way JOIN -> a single map reduce job. Prune columns early and pushes predicates closer to the table scan operator to minimize data transfer. Prunes unneeded partitions by query For sampling query – prunes unneeded bucket. 8/13/ HIVE - A warehouse solution over Map Reduce Framework

“Plumbing” of HIVE compiler Hive SQL String SQLs from Client PARSER Convert into Parse Tree Representation SEMANTIC ANALYZER Convert into block-base internal query representation 8/13/ HIVE - A warehouse solution over Map Reduce Framework

Plumbing cont. Logical Plan Generator Convert into internal query representation OPTIMIZER Rewrite plans into more optimized plans Physical Plan Generator Convert into physical plans ( map reduce jobs ) 8/13/ HIVE - A warehouse solution over Map Reduce Framework

Pros  HIVE is a great supplement of Hadoop to bridge the gap between low-level interface requirements required by Hadoop and industry-standard SQL which is more commonplace.  Support of External Tables which makes it easy to access data without ingesting it into HDFS.  Support of ODBC/JDBC which enables the connectivity with many commercial Business Intelligence and/or ETL tools.  Having Intelligence Optimizer (naïve rule-based) which optimizes logical plans by rewriting them into more efficient plans.  Support of Table-level Partitioning to speed up the query times.  A great design decision by using traditional RDBMS to keep Metadata information (Metastore) which is more optimal and proven for random access. 8/13/ HIVE - A warehouse solution over Map Reduce Framework

Cons  hiveSQL is not 100% ANSI-Compliant SQL.  No support for UPDATE & DELETE  No support for singleton INSERT  There is only 1-level of partitioning available.  Rule-based Optimizer doesn’t take into account available resources in generating logical and physical plans.  No Access Control Language supported  No full support for subquery (correlated subquery ). 8/13/ HIVE - A warehouse solution over Map Reduce Framework

Conclusion With the increasing popularity of Hadoop as data platform of choice for many organizations, HIVE becomes a ‘must- have supplement’ to provide greater usability and connectivity within the organization by introducing high-level language support known as hiveQL. 8/13/ HIVE - A warehouse solution over Map Reduce Framework

Example of Query Plans 8/13/2015 HIVE - A warehouse solution over Map Reduce Framework23

Comparable work  Apache Pig Similar approach to HIVE with support of high-level language which generates a sequence of map reduce programs. The language is a proprietary language (aka Pig latin) and it’s NOT a SQL-like language. Performance of any Pig queries tend to be slower in comparison to HIVE or Hadoop. 8/13/2015 HIVE - A warehouse solution over Map Reduce Framework24

References  [1] A. Pavlo et. al. A Comparison of Approaches to Large-Scale Data Analysis. Proc. ACM SIGMOD,  [2] C.Ronnie et al. SCOPE: Easy and Ecient Parallel Processing of Massive Data Sets. Proc. VLDB Endow., 1(2):1265{1276,  [3] Apache Hadoop. Available at  [4] Hive Performance Benchmark. Available at  [5] Hive Language Manual. Available at  [6] Facebook Lexicon. Available at  [7] Apache Pig.  [8] Apache Thrift. 8/13/ HIVE - A warehouse solution over Map Reduce Framework

Q & A 8/13/ HIVE - A warehouse solution over Map Reduce Framework