Download presentation
Presentation is loading. Please wait.
1
Introduction to Hive Liyin Tang liyintan@usc.edu
2
Outline Motivation Overview Data Model / Metadata Architecture Performance Cons and Pros Application Related Work 2 Introduction to Hive6/11/2015
3
Motivation 3 Introduction to Hive6/11/2015 Web Servers Scribe Writers Realtime Hadoop Cluster Hadoop Hive WarehouseOracle RAC MySQL Scribe MidTier http://hadoopblog.blogspot.com/2009/06/hdfs-scribe-integration.html
4
Motivation Limitation of MR Have to use M/R model Not Reusable Error prone For complex jobs: Multiple stage of Map/Reduce functions Just like ask dev to write specify physical execution plan in the database 4 Introduction to Hive6/11/2015
5
Overview Intuitive Make the unstructured data looks like tables regardless how it really lay out SQL based query can be directly against these tables Generate specify execution plan for this query What’s Hive A data warehousing system to store structured data on Hadoop file system Provide an easy query these data by execution Hadoop MapReduce plans 5 Introduction to Hive6/11/2015
6
Data Model Tables Basic type columns (int, float, boolean) Complex type: List / Map ( associate array) Partitions Buckets CREATE TABLE sales( id INT, items ARRAY ) PARITIONED BY (ds STRING) CLUSTERED BY (id) INTO 32 BUCKETS; SELECT id FROM sales TABLESAMPLE (BUCKET 1 OUT OF 32) 6 Introduction to Hive6/11/2015
7
Metadata Database namespace Table definitions schema info, physical location In HDFS Partition data ORM Framework All the metadata can be stored in Derby by default Any database with JDBC can be configed 7 Introduction to Hive6/11/2015
8
8 HDFS Map Reduce Web UI + Hive CLI + JDBC/ODBC Browse, Query, DDL MetaStore Thrift API Hive QL Parser Planner Optimizer Execution SerDe CSV Thrift Regex UDF/UDAF substr sum average FileFormats TextFile SequenceFile RCFile User-defined Map-reduce Scripts Architecture http://www.slideshare.net/cloudera/hw09-hadoop-development-at- facebook-hive-and-hdfs
9
Performance GROUP BY operation Efficient execution plans based on: Data skew: how evenly distributed data across a number of physical nodes bottleneck VS load balance Partial aggregation: Group the data with the same group by value as soon as possible In memory hash-table for mapper Earlier than combiner 9 Introduction to Hive6/11/2015
10
Performance JOIN operation Traditional Map-Reduce Join Early Map-side Join very efficient for joining a small table with a large table Keep smaller table data in memory first Join with a chunk of larger table data each time Space complexity for time complexity 10 Introduction to Hive7/20/2010
11
Performance Ser/De Describe how to load the data from the file into a representation that make it looks like a table; Lazy load Create the field object when necessary Reduce the overhead to create unnecessary objects in Hive Java is expensive to create objects Increase performance 11 Introduction to Hive7/20/2010
12
Hive – Performance QueryA: SELECT count(1) FROM t; QueryB: SELECT concat(concat(concat(a,b),c),d) FROM t; QueryC: SELECT * FROM t; map-side time only (incl. GzipCodec for comp/decompression) * These two features need to be tested with other queries. http://www.slideshare.net/cloudera/hw09-hadoop-development-at- facebook-hive-and-hdfs DateSVN RevisionMajor ChangesQuery AQuery BQuery C 2/22/2009746906Before Lazy Deserialization83 sec98 sec183 sec 2/23/2009747293Lazy Deserialization40 sec66 sec185 sec 3/6/2009751166Map-side Aggregation22 sec67 sec182 sec 4/29/2009770074Object Reuse21 sec49 sec130 sec 6/3/2009781633Map-side Join *21 sec48 sec132 sec 8/5/2009801497Lazy Binary Format *21 sec48 sec132 sec
13
Pros Pros A easy way to process large scale data Support SQL-based queries Provide more user defined interfaces to extend Programmability Efficient execution plans for performance Interoperability with other database tools 13 Introduction to Hive6/11/2015
14
Cons Cons No easy way to append data Files in HDFS are immutable Future work Views / Variables More operator In/Exists semantic More future work in the mail list 14 Introduction to Hive6/11/2015
15
Application Log processing Daily Report User Activity Measurement Data/Text mining Machine learning (Training Data) Business intelligence Advertising Delivery Spam Detection 15 Introduction to Hive7/20/2010
16
Related Work Parallel databases: Gamma, Bubba, Volcano Google: Sawzall Yahoo: Pig IBM: JAQL Microsoft: DradLINQ, SCOPE 16 Introduction to Hive7/20/2010
17
Reference [1] A.Thusoo et al. Hive: a warehousing solution over a map-reduce framework. Proceedings of VLDB09', 2009. [2] Hadoop 2009: http://www.slideshare.net/cloudera/hw09-hadoop- development-at-facebook-hive-and-hdfs http://www.slideshare.net/cloudera/hw09-hadoop- development-at-facebook-hive-and-hdfs [4] Facebook Data Team: http://www.slideshare.net/zshao/hive-data- warehousing-analytics-on-hadoop-presentation http://www.slideshare.net/zshao/hive-data- warehousing-analytics-on-hadoop-presentation [3] Cloudera: http://www.cloudera.com/videos/introduction_to_hi ve http://www.cloudera.com/videos/introduction_to_hi ve 17 Introduction to Hive7/20/2010
18
Q & A Thank you
19
Back up
20
Hive Components Shell Interface: Like the MySQL shell Driver: Session handles, fetch, exeucition Complier: Prarse,plan,optimzie Execution Engine: DAG stage Run map or reduce 20 Introduction to Hive7/20/2010
21
Motivation MapReduce Motivation Data processing: > 1 TB Massively parallel Locality Fault Tolerant 21 Introduction to Hive7/20/2010
22
Hive Usage hive> show tables; hive> create table SHAKESPEARE (freq INT,word STRING) row format delimited fields terminated by ‘\t’ stored as textfile hive> load data inpath “shakespeare_freq” into table shakespeare; 22 Introduction to Hive
23
Hive Usage hive> load data inpath “shakespeare_freq” into table shakespeare; hive> select * from shakespeare where freq>100 sort by freq asc limit 10; 23 Introduction to Hive
24
Hive Usage @ Facebook Statistics per day: 4 TB of compressed new data added per day 135TB of compressed data scanned per day 7500+ Hive jobs on per day Hive simplifies Hadoop: ~200 people/month run jobs on Hadoop/Hive Analysts (non-engineers) use Hadoop through Hive 95% of jobs are Hive Jobs http://www.slideshare.net/cloudera/hw09-hadoop-development- at-facebook-hive-and-hdfshttp://www.slideshare.net/cloudera/hw09-hadoop-development- at-facebook-hive-and-hdfs 24 Introduction to Hive7/20/2010
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.