Download presentation
Published byGregory Pitts Modified over 9 years ago
1
Hive : A Petabyte Scale Data Warehouse Using Hadoop
Lecturer : Prof. Kyungbaek Kim Presenter : Alvin Prayuda Juniarta Dwiyantoro
2
Contents Background Description System Architecture Data Types
Operations Data Models SerDe Installation Guide Practical Example
3
Background Hadoop = Hive is the solution Pro Cons
Superior in scalability/availability/manageability Effeciency scaled with more hardware Cons Map-reduce hard to program (user know sql) Need to publish data in well known structure Hive is the solution
4
Description Hive is a data warehouse software to facilitate querying and manage larga datasets in distributed storages Provides access to file stored in HDFS and query execution via MapReduce Hive use a simple SQL-like querie language to enable users familiars with SQL to query the data Hive not designed for OLTP (online transactional processing) and doesn’t offer real-time queries. Hive values are in scalability, extensibility, fault-tolerance, loose- coupling with its input format
5
System Architecture UI – The user interface for users to submit queries and other operations to the system. As of 2011 the system had a command line interface and a web based GUI was being developed. Driver – The component which receives the queries. This component implements the notion of session handles and provides execute and fetch APIs modeled on JDBC/ODBC interfaces. Compiler – The component that parses the query, does semantic analysis on the different query blocks and query expressions and eventually generates an execution plan with the help of the table and partition metadata looked up from the metastore. Metastore – The component that stores all the structure information of the various tables and partitions in the warehouse including column and column type information, the serializers and deserializers necessary to read and write data and the corresponding HDFS files where the data is stored. Execution Engine – The component which executes the execution plan created by the compiler. The plan is a DAG of stages. The execution engine manages the dependencies between these different stages of the plan and executes these stages on the appropriate system components
6
Data Types Data type that supported by current Hive (v 0.13.1)
Numeric Types Tinyint, smallint, int, bigint, float, double, decimal < (user defined precision and scale) Date/time Timestamp, date String String, varchar, char Misc Boolean, binary Complex Arrays, maps, structs, union
7
Operations 3 kinds of operation in Hive :
Data Definition Language (DDL) Operation Data Manipulation Language (DML) Operation Structured Query Language (SQL) Operation
8
Operations DDL Operations Create/drop/alter database Use database
Create/drop/truncate table Alter table/partition/column Create/drop/alter view Create/drop/alter index Create/drop function Create/drop/grant/revoke roles and privileges Show Describe Export/Import
9
Operations Example DDL Create table Alter table Drop table
10
Operations DML Operations Loading files into tables
Inserting data into Hive Tables from Queries Writing data into filesystem from Queries Inserting values into tables from SQL (v.0.14) Updating values in tables from SQL (v.0.14) Deleting values in tables from SQL (v.0.14)
11
Operations Example DML Load data into table Write data into files
12
Operations SQL Operations Extensibility Select and Filters Group By
Insert Overwrite and Insert into Join Multitable insert Extensibility Pluggable Map-reduce script using transform
13
Operations Example SQL Show data Insert data with select statement
Group by
14
Operations Join Multitable Insert
15
Data Models Data in Hive is organized into :
Table : A table model like in relational database, stored in HDFS Partition : A partition of table which is stored in a sub-directory within a table’s directory, allow the system to prune data to be inspected based on query predicates Example : a query that is interested in rows from T that satisfy the predicate T.ds = ' ' would only have to look at files in <table location>/ds= / directory in HDFS Bucket : Data in each partition may in turn be divided into Buckets based on the hash of a column in the table. Each bucket is stored as a file in the partition directory, allows the system to efficiently evaluate queries that depend on a sample of data
16
SerDe SerDe is short for Serializer/Deserializer. Hive uses the SerDe interface for IO. The interface handles both serialization and deserialization and also interpreting the results of serialization as individual fields for processing. A SerDe allows Hive to read in data from a table, and write it back out to HDFS in any custom format. Anyone can write their own SerDe for their own data formats.
17
Projects Related to Hive
Shark Fork of Apache Hive that using Spark instead of map-reduce Hivemall Machine-learning library for Hive Apache Sentry Role-based authorization system for Hive
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.