Introduction to Hive Liyin Tang

Slides:



Advertisements
Similar presentations
HBase and Hive at StumbleUpon
Advertisements

CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
Hive - A Warehousing Solution Over a Map-Reduce Framework.
HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
Reynold Xin Shark: Hive (SQL) on Spark. Stage 0: Map-Shuffle-Reduce Mapper(row) { fields = row.split("\t") emit(fields[0], fields[1]); } Reducer(key,
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Hive: A data warehouse on Hadoop
PARALLEL DBMS VS MAP REDUCE “MapReduce and parallel DBMSs: friends or foes?” Stonebraker, Daniel Abadi, David J Dewitt et al.
HIVE Data Warehousing & Analytics on Hadoop Joydeep Sen Sarma, Ashish Thusoo Facebook Data Team.
CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.
A warehouse solution over map-reduce framework Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff.
Raghav Ayyamani. Copyright Ellis Horowitz, Why Another Data Warehousing System? Problem : Data, data and more data Several TBs of data everyday.
Hive – A Warehousing Solution Over a Map-Reduce Framework Presented by: Atul Bohara Feb 18, 2014.
SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.
HADOOP ADMIN: Session -2
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
MapReduce.
Analytics Map Reduce Query Insight Hive Pig Hadoop SQL Map Reduce Business Intelligence Predictive Operational Interactive Visualization Exploratory.
Joe Hummel, PhD Visiting Researcher: U. of California, Irvine Adjunct Professor: U. of Illinois, Chicago & Loyola U., Chicago Materials:
Hive : A Petabyte Scale Data Warehouse Using Hadoop
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Cloud Computing Other High-level parallel processing languages Keke Chen.
Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from systems Batch Processing High Throughput Partition-able problems Fault Tolerance.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Penwell Debug Intel Confidential BRIEF OVERVIEW OF HIVE Jonathan Brauer ESE 380L Feb
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
Hive Facebook 2009.
Whirlwind Tour of Hadoop Edward Capriolo Rev 2. Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from systems Batch Processing High.
An Introduction to HDInsight June 27 th,
A NoSQL Database - Hive Dania Abed Rabbou.
Hive – SQL on top of Hadoop
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Introduction to Hbase. Agenda  What is Hbase  About RDBMS  Overview of Hbase  Why Hbase instead of RDBMS  Architecture of Hbase  Hbase interface.
Hive. What is Hive? Data warehousing layer on top of Hadoop – table abstractions SQL-like language (HiveQL) for “batch” data processing SQL is translated.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
IBM Research ® © 2007 IBM Corporation A Brief Overview of Hadoop Eco-System.
Impala. Impala: Goals General-purpose SQL query engine for Hadoop High performance – C++ implementation – runtime code generation (using LLVM) – direct.
Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.
Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.
Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.
Moscow, November 16th, 2011 The Hadoop Ecosystem Kai Voigt, Cloudera Inc.
CPSC8985 FA 2015 Team C3 DATA MIGRATION FROM RDBMS TO HADOOP By Naga Sruthi Tiyyagura Monika RallabandiRadhakrishna Nalluri.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
HIVE – A PETABYTE SCALE DATA WAREHOUSE USING HADOOP -Abhilash Veeragouni Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning.
Image taken from: slideshare
HIVE A Warehousing Solution Over a MapReduce Framework
Hadoop.
Hive - A Warehousing Solution Over a Map-Reduce Framework
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
A Warehousing Solution Over a Map-Reduce Framework
Hive Mr. Sriram
Introduction to MapReduce and Hadoop
Hadoop EcoSystem B.Ramamurthy.
Rekha Singhal, Amol Khanapurkar, TCS Mumbai.
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Server & Tools Business
Introduction to Apache
Overview of big data tools
Pig - Hive - HBase - Zookeeper
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
05 | Processing Big Data with Hive
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Pig Hive HBase Zookeeper
Presentation transcript:

Introduction to Hive Liyin Tang

Outline  Motivation  Overview  Data Model / Metadata  Architecture  Performance  Cons and Pros  Application  Related Work 2 Introduction to Hive6/11/2015

Motivation 3 Introduction to Hive6/11/2015 Web Servers Scribe Writers Realtime Hadoop Cluster Hadoop Hive WarehouseOracle RAC MySQL Scribe MidTier

Motivation  Limitation of MR  Have to use M/R model  Not Reusable  Error prone  For complex jobs:  Multiple stage of Map/Reduce functions  Just like ask dev to write specify physical execution plan in the database 4 Introduction to Hive6/11/2015

Overview  Intuitive  Make the unstructured data looks like tables regardless how it really lay out  SQL based query can be directly against these tables  Generate specify execution plan for this query  What’s Hive  A data warehousing system to store structured data on Hadoop file system  Provide an easy query these data by execution Hadoop MapReduce plans 5 Introduction to Hive6/11/2015

Data Model  Tables  Basic type columns (int, float, boolean)  Complex type: List / Map ( associate array)  Partitions  Buckets  CREATE TABLE sales( id INT, items ARRAY ) PARITIONED BY (ds STRING) CLUSTERED BY (id) INTO 32 BUCKETS;  SELECT id FROM sales TABLESAMPLE (BUCKET 1 OUT OF 32) 6 Introduction to Hive6/11/2015

Metadata  Database namespace  Table definitions  schema info, physical location In HDFS  Partition data  ORM Framework  All the metadata can be stored in Derby by default  Any database with JDBC can be configed 7 Introduction to Hive6/11/2015

8 HDFS Map Reduce Web UI + Hive CLI + JDBC/ODBC Browse, Query, DDL MetaStore Thrift API Hive QL Parser Planner Optimizer Execution SerDe CSV Thrift Regex UDF/UDAF substr sum average FileFormats TextFile SequenceFile RCFile User-defined Map-reduce Scripts Architecture facebook-hive-and-hdfs

Performance  GROUP BY operation  Efficient execution plans based on:  Data skew:  how evenly distributed data across a number of physical nodes  bottleneck VS load balance  Partial aggregation:  Group the data with the same group by value as soon as possible  In memory hash-table for mapper  Earlier than combiner 9 Introduction to Hive6/11/2015

Performance  JOIN operation  Traditional Map-Reduce Join  Early Map-side Join  very efficient for joining a small table with a large table  Keep smaller table data in memory first  Join with a chunk of larger table data each time  Space complexity for time complexity 10 Introduction to Hive7/20/2010

Performance  Ser/De  Describe how to load the data from the file into a representation that make it looks like a table;  Lazy load  Create the field object when necessary  Reduce the overhead to create unnecessary objects in Hive  Java is expensive to create objects  Increase performance 11 Introduction to Hive7/20/2010

Hive – Performance  QueryA: SELECT count(1) FROM t;  QueryB: SELECT concat(concat(concat(a,b),c),d) FROM t;  QueryC: SELECT * FROM t;  map-side time only (incl. GzipCodec for comp/decompression)  * These two features need to be tested with other queries. facebook-hive-and-hdfs DateSVN RevisionMajor ChangesQuery AQuery BQuery C 2/22/ Before Lazy Deserialization83 sec98 sec183 sec 2/23/ Lazy Deserialization40 sec66 sec185 sec 3/6/ Map-side Aggregation22 sec67 sec182 sec 4/29/ Object Reuse21 sec49 sec130 sec 6/3/ Map-side Join *21 sec48 sec132 sec 8/5/ Lazy Binary Format *21 sec48 sec132 sec

Pros  Pros  A easy way to process large scale data  Support SQL-based queries  Provide more user defined interfaces to extend  Programmability  Efficient execution plans for performance  Interoperability with other database tools 13 Introduction to Hive6/11/2015

Cons  Cons  No easy way to append data  Files in HDFS are immutable  Future work  Views / Variables  More operator  In/Exists semantic  More future work in the mail list 14 Introduction to Hive6/11/2015

Application  Log processing  Daily Report  User Activity Measurement  Data/Text mining  Machine learning (Training Data)  Business intelligence  Advertising Delivery  Spam Detection 15 Introduction to Hive7/20/2010

Related Work  Parallel databases: Gamma, Bubba, Volcano  Google: Sawzall  Yahoo: Pig  IBM: JAQL  Microsoft: DradLINQ, SCOPE 16 Introduction to Hive7/20/2010

Reference  [1] A.Thusoo et al. Hive: a warehousing solution over a map-reduce framework. Proceedings of VLDB09',  [2] Hadoop 2009:  development-at-facebook-hive-and-hdfs development-at-facebook-hive-and-hdfs  [4] Facebook Data Team:  warehousing-analytics-on-hadoop-presentation warehousing-analytics-on-hadoop-presentation  [3] Cloudera:  ve ve 17 Introduction to Hive7/20/2010

Q & A Thank you

Back up

Hive Components  Shell Interface: Like the MySQL shell  Driver:  Session handles, fetch, exeucition  Complier:  Prarse,plan,optimzie  Execution Engine:  DAG stage  Run map or reduce 20 Introduction to Hive7/20/2010

Motivation  MapReduce Motivation  Data processing: > 1 TB  Massively parallel  Locality  Fault Tolerant 21 Introduction to Hive7/20/2010

Hive Usage  hive> show tables;  hive> create table SHAKESPEARE (freq INT,word STRING) row format delimited fields terminated by ‘\t’ stored as textfile  hive> load data inpath “shakespeare_freq” into table shakespeare; 22 Introduction to Hive

Hive Usage  hive> load data inpath “shakespeare_freq” into table shakespeare;  hive> select * from shakespeare where freq>100 sort by freq asc limit 10; 23 Introduction to Hive

Hive Facebook  Statistics per day:  4 TB of compressed new data added per day  135TB of compressed data scanned per day  Hive jobs on per day  Hive simplifies Hadoop:  ~200 people/month run jobs on Hadoop/Hive  Analysts (non-engineers) use Hadoop through Hive  95% of jobs are Hive Jobs at-facebook-hive-and-hdfshttp:// at-facebook-hive-and-hdfs 24 Introduction to Hive7/20/2010