Hive – A Warehousing Solution Over a Map-Reduce Framework Presented by: Atul Bohara Feb 18, 2014.

Slides:



Advertisements
Similar presentations
Shark:SQL and Rich Analytics at Scale
Advertisements

CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.
Software and Services Group “Project Panthera”: Better Analytics with SQL, MapReduce and HBase Jason Dai Principal Engineer Intel SSG (Software and Services.
Hive - A Warehousing Solution Over a Map-Reduce Framework.
HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Introduction to Hive Liyin Tang
Hive: A data warehouse on Hadoop
ETM Hadoop. ETM IDC estimate put the size of the “digital universe” at zettabytes in forecasting a tenfold growth by 2011 to.
HIVE Data Warehousing & Analytics on Hadoop Joydeep Sen Sarma, Ashish Thusoo Facebook Data Team.
UT DALLAS Erik Jonsson School of Engineering & Computer Science FEARLESS engineering Secure Data Storage and Retrieval in the Cloud Bhavani Thuraisingham,
CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.
A warehouse solution over map-reduce framework Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff.
Raghav Ayyamani. Copyright Ellis Horowitz, Why Another Data Warehousing System? Problem : Data, data and more data Several TBs of data everyday.
SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL.
Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.
HADOOP ADMIN: Session -2
DLRL Cluster Matt Bollinger, Joseph Pontani, Adam Lech Client: Sunshin Lee CS4624 Capstone Project March 3, 2014 Virginia Tech, Blacksburg, VA.
Analytics Map Reduce Query Insight Hive Pig Hadoop SQL Map Reduce Business Intelligence Predictive Operational Interactive Visualization Exploratory.
Facebook (stylized facebook) is a Social Networking System and website launched in February 2004, operated and privately owned by Facebook, Inc. As.
Pig: Making Hadoop Easy Wednesday, June 10, 2009 Santa Clara Marriott.
Hive : A Petabyte Scale Data Warehouse Using Hadoop
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Cloud Computing Other High-level parallel processing languages Keke Chen.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Penwell Debug Intel Confidential BRIEF OVERVIEW OF HIVE Jonathan Brauer ESE 380L Feb
Introduction to Hadoop and HDFS
Hive Facebook 2009.
Storage and Analysis of Tera-scale Data : 2 of Database Class 11/24/09
MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.
An Introduction to HDInsight June 27 th,
A NoSQL Database - Hive Dania Abed Rabbou.
Hive – SQL on top of Hadoop
SLIDE 1IS 257 – Fall 2014 NewSQL and VoltDB University of California, Berkeley School of Information IS 257: Database Management.
Indexing HDFS Data in PDW: Splitting the data from the index VLDB2014 WSIC、Microsoft Calvin
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
Introduction to Hbase. Agenda  What is Hbase  About RDBMS  Overview of Hbase  Why Hbase instead of RDBMS  Architecture of Hbase  Hbase interface.
Hung-chih Yang 1, Ali Dasdan 1 Ruey-Lung Hsiao 2, D. Stott Parker 2
Hive. What is Hive? Data warehousing layer on top of Hadoop – table abstractions SQL-like language (HiveQL) for “batch” data processing SQL is translated.
IBM Research ® © 2007 IBM Corporation A Brief Overview of Hadoop Eco-System.
Nov 2006 Google released the paper on BigTable.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Impala. Impala: Goals General-purpose SQL query engine for Hadoop High performance – C++ implementation – runtime code generation (using LLVM) – direct.
BACS 287 Big Data & NoSQL 2016 by Jones & Bartlett Learning LLC.
Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.
Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.
Moscow, November 16th, 2011 The Hadoop Ecosystem Kai Voigt, Cloudera Inc.
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Authored by Sameer Agarwal, et. al. Presented by Atul Sandur.
Hive Big data for CSci 4707 students! Eric Atherton and Henry Hoang.
Microsoft Ignite /28/2017 6:07 PM
HIVE – A PETABYTE SCALE DATA WAREHOUSE USING HADOOP -Abhilash Veeragouni Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning.
Image taken from: slideshare
HIVE A Warehousing Solution Over a MapReduce Framework
Hadoop.
Hive - A Warehousing Solution Over a Map-Reduce Framework
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
A Warehousing Solution Over a Map-Reduce Framework
Hive Mr. Sriram
SQOOP.
Hadoop EcoSystem B.Ramamurthy.
Rekha Singhal, Amol Khanapurkar, TCS Mumbai.
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Server & Tools Business
Introduction to Apache
Pig - Hive - HBase - Zookeeper
Data Warehousing in the age of Big Data (1)
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Pig Hive HBase Zookeeper
Presentation transcript:

Hive – A Warehousing Solution Over a Map-Reduce Framework Presented by: Atul Bohara Feb 18, 2014

Outline  Cloud Programming  Background and Motivation  Hive  Introduction  Database Design  Hive Query Language  Architecture  Analysis  Conclusion and Discussion

Cloud Programming  Applications built for cloud computing -Handle large data sets -Distributed and parallel processing  Hadoop, Hive, Pig, HBase, Cassandra, Storm, Spark are some of the most popular systems

Background and Motivation  Data, data and more data -FB warehouse grows half PB per day*  Traditional warehousing solutions Insufficient and Expensive  Hadoop (map-reduce and HDFS) -Scale out, available, fault-tolerant  Problem: map-reduce programming model is low level -Hard to maintain and reuse -Limited expressiveness, no schema  Solution: Hive *source:

What is Hive? Open-source data warehousing solution built on top of Hadoop.

Conceptual Data Flow Hive RDBMS or File System Web servers Frequency of status updates based on gender Frequency of status updates based on school Top 10 most popular status memes per school Users Data Engineer/ Analyst Hadoop

Example: StatusMeme  Creating Table CREATE TABLE status_updates (userid int, status string, ds string) PARTITIONED BY (ds string) CLUSTERED BY (userid) SORTED BY (ds) INTO 32 BUCKETS STORED AS SEQUENCEFILE  profiles(userid int, school string, gender int)

Hive Data Model status_ updates status_ updates ds=‘ ’ ds=‘ ’ ds=‘ ’ HDFS Directory Sub-directories Files Part Part-00001

StatusMeme (contd.)  Loading Data LOAD DATA LOCAL INPAT ‘/logs/status_updates’ INTO TABLE status_updates PARTITION (ds=‘ ’)

StatusMeme (contd.)  Multi-table Insert and Join FROM (SELECT a.status, b.school, b.gender FROM status_updates a JOIN profiles b ON (a.userid = b.userid and a.ds=‘ ’) ) subq1 INSERT OVERWRITE TABLE gender_summary PARTITION(ds=‘ ’) SELECT subq1.gender, COUNT(1) GROUP BY subq1.gender INSERT OVERWRITE TABLE school_summary PARTITION (ds=‘ ’) SELECT subq1.school, COUNT(1) GROUP BY subq1.school Frequency of status updates based on gender? Frequency of status updates based on school?

StatusMeme (contd.)  HiveQL Map-Reduce Constructs REDUCE subq2.school, subq2.meme, subq2.cnt USING ‘top10.py’ AS (school, meme, cnt) FROM (SELECT subq1.school, subq1.meme, COUNT(1) AS cnt FROM ( MAP b.school, a.status USING ‘meme_extractor.py’ AS (school, meme) FROM status_updates a JOIN profiles b ON (a.userid = b.userid) ) subq1 GROUP BY subq1.school, subq1.meme DISTRIBUTE BY school, meme SORY BY school, meme, cnt desc ) subq2; Top 10 most popular status memes per school?

Hive Query Language  SQL-like query language called HiveQL -DDL: Create, Drop, Alter (tables/partitions/columns) -DML: Load and insert Data, select, project, join, aggregate, union all -Primitive and complex data types -Operators and functions -Support for variety of file formats

Hive Architecture

 External Interfaces -User interfaces like CLI, web UI -APIs like JDBC and ODBC -Other interfaces like Microsoft HDInsight

Hive Architecture  Thrift Server -Exposes a simple client API to execute HiveQL statements -Thrift is a framework for cross-language services -Delegates client requests for metadata to metastore or hadoop file system

Hive Architecture  Metastore -Stores metadata and statistics for databases, tables, partitions -A relational database or file system optimized for random accesses and updates -Consistency between metadata and data should be maintained explicitly

Hive Architecture  Driver -Manages life cycle of HiveQL statement: compilation, optimization and execution -Uses a session handle to keep track of a statement

Hive Architecture  Compiler -Translate query statements into a ‘Plan’ Metadata operations (DDL statements) HDFS operations (LOAD statement) DAG of map-reduce jobs (insert/query)

Hive Architecture  Optimizer -Combines multiple joins -Adds repartition operator (ReduceSinkOperator) -Minimize data transfers and pruning -User hinted optimizations

Hive Architecture  Execution Engine -Runs DAG of map-reduce jobs on Hadoop in proper dependency order

Query Plan generated by the compiler for StatusMeme Example

Analysis  Hive performance benchmark Experiment -To compare performance of Hadoop, Hive and Pig -Executed 4 queries (2 select, 1 aggregation and 1 join query) -Data set: -Grep table, 2 columns, 50 GB data -Rankings table, 3 columns, 3.3 GB data -Uservisits table, 9 columns, 60 GB data Source: Hive performance benchmark:

Analysis  Hive performance benchmark results Source: Hive performance benchmark:

Conclusion  Data warehousing solution on top of Hadoop -Usability (HiveQL, intuitive and precise code) -Extensibility (custom map-reduce scripts, User defined data types and functions) -Interoperability (support for variety of file and data formats) -Performance (optimizations in data scan, data pruning)  Hive use cases -Reporting (summarization, ad hoc queries) -Data mining and machine learning -Business intelligence  Hive Adopters -Facebook, Netflix, Yahoo, Taobao, cnet, hi5 and many more…

Discussion  Hive is not suitable for OLTP -No real time queries and row level updates -Latency of minutes for small queries  Map-reduce vs Hive (Pig) Map-reduce -complex and sophisticated data processing -Full control to minimize map-reduce jobs -Better in performance than Hive most of the times -Writing Java code for map-reduce is difficult and time consuming Hive -More suited for ad-hoc queries -Intuitive and easy to write SQL like queries

Discussion

 Hive vs Pig -Hive is more natural to database developers and Pig is convenient for script/perl developers -Hive gives better performance than Pig -Pig is more suited for ETL and data movement, while Hive is better for ad hoc queries

Discussion “Hive is evolving as a parallel SQL DBMS which happens to use Hadoop as its storage and execution architecture”  Increasing HiveQL Capabilities -Insert, update and delete with full ACID support -REGEX support in selecting files for External Table -Support for ‘dual’ table, information_schema -Improving ‘Explain’  Optimization Techniques -Cost-based optimization  Performance Improvements -Adoption of RCFile for data placement, and most recently the ORC  Integration with other Frameworks -Support with HBase, Cassandra

References  [1] Hive paper  [2] Hive wiki  [3] Hive performance benchmarks  [4] Hive Facebook presentation sig_2010_v21.pdf  [5] Hive vs Pig: Yahoo developers  [6] Hive Apache JIRA page

Backup Slides  Extra Material

HiveQL Capabilities  Simple and partition based queries  Joins  Aggregations  Multi-table or File inserts  Dynamic-partition inserts  Local file insert  Sampling  Union all  Array operations  Custom map-reduce scripts

Hive Data Model  External Tables -Point to existing data directories in HDFS, NFS or local directories -Used for data filtering  Serialization/De-serialization (SerDe) -Tables are serialized before storing into directories -Built-in formats using compression and lazy de- serialization -User defined (de)serialize methods are supported, called SerDe’s

Compilation Process  Parser -Converts query string to parse tree  Semantic Analyzer -Converts parse tree to block-based internal query representation -Verification, expansion of select * and type checking  Logical Plan Generator -Converts internal query to logical plan  Optimizer -Combines multiple joins -Adds repartition operator (ReducedSinkOperator) -Minimize data transfers and pruning -User hinted optimizations  Physical Plan Generator -Converts logical plan into physical plan (DAG of map-reduce jobs)