Hive: A data warehouse on Hadoop

Slides:



Advertisements
Similar presentations
Shark Hive SQL on Spark Michael Armbrust.
Advertisements

CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.
Hive - A Warehousing Solution Over a Map-Reduce Framework.
HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
Reynold Xin Shark: Hive (SQL) on Spark. Stage 0: Map-Shuffle-Reduce Mapper(row) { fields = row.split("\t") emit(fields[0], fields[1]); } Reducer(key,
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Introduction to Hive Liyin Tang
ETM Hadoop. ETM IDC estimate put the size of the “digital universe” at zettabytes in forecasting a tenfold growth by 2011 to.
HIVE Data Warehousing & Analytics on Hadoop Joydeep Sen Sarma, Ashish Thusoo Facebook Data Team.
CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.
Hadoop Ecosystem Overview
A warehouse solution over map-reduce framework Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff.
Raghav Ayyamani. Copyright Ellis Horowitz, Why Another Data Warehousing System? Problem : Data, data and more data Several TBs of data everyday.
Hive – A Warehousing Solution Over a Map-Reduce Framework Presented by: Atul Bohara Feb 18, 2014.
SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL.
Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.
Hive : A Petabyte Scale Data Warehouse Using Hadoop
Cloud Computing Other High-level parallel processing languages Keke Chen.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
NoSQL continued CMSC 461 Michael Wilson. MongoDB  MongoDB is another NoSQL solution  Provides a bit more structure than a solution like Accumulo  Data.
Penwell Debug Intel Confidential BRIEF OVERVIEW OF HIVE Jonathan Brauer ESE 380L Feb
Introduction to Hadoop and HDFS
Hive Facebook 2009.
Enabling data management in a big data world Craig Soules Garth Goodson Tanya Shastri.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
An Introduction to HDInsight June 27 th,
Hive – A Warehousing Solution Over a MapReduce Framework Bingbing Liu
A NoSQL Database - Hive Dania Abed Rabbou.
Hive – SQL on top of Hadoop
SLIDE 1IS 257 – Fall 2014 NewSQL and VoltDB University of California, Berkeley School of Information IS 257: Database Management.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Introduction to Hbase. Agenda  What is Hbase  About RDBMS  Overview of Hbase  Why Hbase instead of RDBMS  Architecture of Hbase  Hbase interface.
Hive. What is Hive? Data warehousing layer on top of Hadoop – table abstractions SQL-like language (HiveQL) for “batch” data processing SQL is translated.
Hadoop implementation of MapReduce computational model Ján Vaňo.
IBM Research ® © 2007 IBM Corporation A Brief Overview of Hadoop Eco-System.
Nov 2006 Google released the paper on BigTable.
Impala. Impala: Goals General-purpose SQL query engine for Hadoop High performance – C++ implementation – runtime code generation (using LLVM) – direct.
BACS 287 Big Data & NoSQL 2016 by Jones & Bartlett Learning LLC.
Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.
HIVE Data Warehousing & Analytics on Hadoop Facebook Data Team.
Apache Hive CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.
Prediction-Based Multivariate Query Modeling Analytic Queries.
BIG DATA. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database.
1 Copyright © 2008, Oracle. All rights reserved. Repository Basics.
Apache Hadoop on Windows Azure Avkash Chauhan
Hive Big data for CSci 4707 students! Eric Atherton and Henry Hoang.
1 Gaurav Kohli Xebia Breaking with DBMS and Dating with Relational Hbase.
HIVE – A PETABYTE SCALE DATA WAREHOUSE USING HADOOP -Abhilash Veeragouni Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning.
HIVE A Warehousing Solution Over a MapReduce Framework
Mail call Us: / / Hadoop Training Sathya technologies is one of the best Software Training Institute.
Hadoop.
Hive - A Warehousing Solution Over a Map-Reduce Framework
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Chapter 14 Big Data Analytics and NoSQL
A Warehousing Solution Over a Map-Reduce Framework
Hadoop.
Big Data Intro.
Hive Mr. Sriram
Nipa Das, Ye Jee Kim, Murphy Potts, Sadaf Mirzai
Hadoop EcoSystem B.Ramamurthy.
Rekha Singhal, Amol Khanapurkar, TCS Mumbai.
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Server & Tools Business
Tools for Processing Big Data Jinan Al Aridhee and Christian Bach
Overview of big data tools
Pig - Hive - HBase - Zookeeper
Data Warehousing in the age of Big Data (1)
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Pig Hive HBase Zookeeper
Presentation transcript:

Hive: A data warehouse on Hadoop Based on Facebook Team’s paper 4/16/2017

4/16/2017

Motivation Yahoo worked on Pig to facilitate application deployment on Hadoop. Their need mainly was focused on unstructured data Simultaneously Facebook started working on deploying warehouse solutions on Hadoop that resulted in Hive. The size of data being collected and analyzed in industry for business intelligence (BI) is growing rapidly making traditional warehousing solution prohibitively expensive. 4/16/2017

Hadoop MR MR is very low level and requires customers to write custom programs. HIVE supports queries expressed in SQL-like language called HiveQL which are compiled into MR jobs that are executed on Hadoop. Hive also allows MR scripts It also includes MetaStore that contains schemas and statistics that are useful for data explorations, query optimization and query compilation. At Facebook Hive warehouse contains tens of thousands of tables, stores over 700TB and is used for reporting and ad-hoc analyses by 200 Fb users. 4/16/2017

Hive architecture (from the paper) 4/16/2017

Data model Hive structures data into well-understood database concepts such as: tables, rows, cols, partitions It supports primitive types: integers, floats, doubles, and strings Hive also supports: associative arrays: map<key-type, value-type> Lists: list<element type> Structs: struct<file name: file type…> SerDe: serialize and deserialized API is used to move data in and out of tables 4/16/2017

Query Language (HiveQL) Subset of SQL Meta-data queries Limited equality and join predicates No inserts on existing tables (to preserve worm property) Can overwrite an entire table 4/16/2017

Wordcount in Hive FROM ( MAP doctext USING 'python wc_mapper.py' AS (word, cnt) FROM docs CLUSTER BY word ) a REDUCE word, cnt USING 'pythonwc_reduce.py'; 4/16/2017

Session/tmstamp example FROM ( FROM session_table SELECT sessionid, tstamp, data DISTRIBUTE BY sessionid SORT BY tstamp ) a REDUCE sessionid, tstamp, data USING 'session_reducer.sh'; 4/16/2017

Data Storage Tables are logical data units; table metadata associates the data in the table to hdfs directories. Hdfs namespace: tables (hdfs directory), partition (hdfs subdirectory), buckets (subdirectories within partition) /user/hive/warehouse/test_table is a hdfs directory 4/16/2017

Hive architecture (from the paper) 4/16/2017

Architecture Metastore: stores system catalog Driver: manages life cycle of HiveQL query as it moves thru’ HIVE; also manages session handle and session statistics Query compiler: Compiles HiveQL into a directed acyclic graph of map/reduce tasks Execution engines: The component executes the tasks in proper dependency order; interacts with Hadoop HiveServer: provides Thrift interface and JDBC/ODBC for integrating other applications. Client components: CLI, web interface, jdbc/odbc inteface Extensibility interface include SerDe, User Defined Functions and User Defined Aggregate Function. 4/16/2017

Sample Query Plan 4/16/2017

Hive Usage in Facebook Hive and Hadoop are extensively used in Facbook for different kinds of operations. 700 TB = 2.1Petabyte after replication! Think of other application model that can leverage Hadoop MR. 4/16/2017