Introduction to Hive Liyin Tang

1 Introduction to Hive Liyin Tang

2 Outline  Motivation  Overview  Data Model / Metadata  Architecture  Performance  Cons and Pros  Application  Related Work 2 Introduction to Hive6/11/2015

3 Motivation 3 Introduction to Hive6/11/2015 Web Servers Scribe Writers Realtime Hadoop Cluster Hadoop Hive WarehouseOracle RAC MySQL Scribe MidTier

4 Motivation  Limitation of MR  Have to use M/R model  Not Reusable  Error prone  For complex jobs:  Multiple stage of Map/Reduce functions  Just like ask dev to write specify physical execution plan in the database 4 Introduction to Hive6/11/2015

5 Overview  Intuitive  Make the unstructured data looks like tables regardless how it really lay out  SQL based query can be directly against these tables  Generate specify execution plan for this query  What’s Hive  A data warehousing system to store structured data on Hadoop file system  Provide an easy query these data by execution Hadoop MapReduce plans 5 Introduction to Hive6/11/2015

6 Data Model  Tables  Basic type columns (int, float, boolean)  Complex type: List / Map ( associate array)  Partitions  Buckets  CREATE TABLE sales( id INT, items ARRAY ) PARITIONED BY (ds STRING) CLUSTERED BY (id) INTO 32 BUCKETS;  SELECT id FROM sales TABLESAMPLE (BUCKET 1 OUT OF 32) 6 Introduction to Hive6/11/2015

7 Metadata  Database namespace  Table definitions  schema info, physical location In HDFS  Partition data  ORM Framework  All the metadata can be stored in Derby by default  Any database with JDBC can be configed 7 Introduction to Hive6/11/2015

8 8 HDFS Map Reduce Web UI + Hive CLI + JDBC/ODBC Browse, Query, DDL MetaStore Thrift API Hive QL Parser Planner Optimizer Execution SerDe CSV Thrift Regex UDF/UDAF substr sum average FileFormats TextFile SequenceFile RCFile User-defined Map-reduce Scripts Architecture facebook-hive-and-hdfs

9 Performance  GROUP BY operation  Efficient execution plans based on:  Data skew:  how evenly distributed data across a number of physical nodes  bottleneck VS load balance  Partial aggregation:  Group the data with the same group by value as soon as possible  In memory hash-table for mapper  Earlier than combiner 9 Introduction to Hive6/11/2015

10 Performance  JOIN operation  Traditional Map-Reduce Join  Early Map-side Join  very efficient for joining a small table with a large table  Keep smaller table data in memory first  Join with a chunk of larger table data each time  Space complexity for time complexity 10 Introduction to Hive7/20/2010

11 Performance  Ser/De  Describe how to load the data from the file into a representation that make it looks like a table;  Lazy load  Create the field object when necessary  Reduce the overhead to create unnecessary objects in Hive  Java is expensive to create objects  Increase performance 11 Introduction to Hive7/20/2010

12 Hive – Performance  QueryA: SELECT count(1) FROM t;  QueryB: SELECT concat(concat(concat(a,b),c),d) FROM t;  QueryC: SELECT * FROM t;  map-side time only (incl. GzipCodec for comp/decompression)  * These two features need to be tested with other queries. facebook-hive-and-hdfs DateSVN RevisionMajor ChangesQuery AQuery BQuery C 2/22/2009746906Before Lazy Deserialization83 sec98 sec183 sec 2/23/2009747293Lazy Deserialization40 sec66 sec185 sec 3/6/2009751166Map-side Aggregation22 sec67 sec182 sec 4/29/2009770074Object Reuse21 sec49 sec130 sec 6/3/2009781633Map-side Join *21 sec48 sec132 sec 8/5/2009801497Lazy Binary Format *21 sec48 sec132 sec

13 Pros  Pros  A easy way to process large scale data  Support SQL-based queries  Provide more user defined interfaces to extend  Programmability  Efficient execution plans for performance  Interoperability with other database tools 13 Introduction to Hive6/11/2015

14 Cons  Cons  No easy way to append data  Files in HDFS are immutable  Future work  Views / Variables  More operator  In/Exists semantic  More future work in the mail list 14 Introduction to Hive6/11/2015

15 Application  Log processing  Daily Report  User Activity Measurement  Data/Text mining  Machine learning (Training Data)  Business intelligence  Advertising Delivery  Spam Detection 15 Introduction to Hive7/20/2010

16 Related Work  Parallel databases: Gamma, Bubba, Volcano  Google: Sawzall  Yahoo: Pig  IBM: JAQL  Microsoft: DradLINQ, SCOPE 16 Introduction to Hive7/20/2010

17 Reference  [1] A.Thusoo et al. Hive: a warehousing solution over a map-reduce framework. Proceedings of VLDB09', 2009.  [2] Hadoop 2009:  development-at-facebook-hive-and-hdfs development-at-facebook-hive-and-hdfs  [4] Facebook Data Team:  warehousing-analytics-on-hadoop-presentation warehousing-analytics-on-hadoop-presentation  [3] Cloudera:  ve ve 17 Introduction to Hive7/20/2010

20 Hive Components  Shell Interface: Like the MySQL shell  Driver:  Session handles, fetch, exeucition  Complier:  Prarse,plan,optimzie  Execution Engine:  DAG stage  Run map or reduce 20 Introduction to Hive7/20/2010

21 Motivation  MapReduce Motivation  Data processing: > 1 TB  Massively parallel  Locality  Fault Tolerant 21 Introduction to Hive7/20/2010

22 Hive Usage  hive> show tables;  hive> create table SHAKESPEARE (freq INT,word STRING) row format delimited fields terminated by ‘\t’ stored as textfile  hive> load data inpath “shakespeare_freq” into table shakespeare; 22 Introduction to Hive

23 Hive Usage  hive> load data inpath “shakespeare_freq” into table shakespeare;  hive> select * from shakespeare where freq>100 sort by freq asc limit 10; 23 Introduction to Hive

24 Hive Usage @ Facebook  Statistics per day:  4 TB of compressed new data added per day  135TB of compressed data scanned per day  7500+ Hive jobs on per day  Hive simplifies Hadoop:  ~200 people/month run jobs on Hadoop/Hive  Analysts (non-engineers) use Hadoop through Hive  95% of jobs are Hive Jobs at-facebook-hive-and-hdfs at-facebook-hive-and-hdfs 24 Introduction to Hive7/20/2010

