HIVE – A PETABYTE SCALE DATA WAREHOUSE USING HADOOP -Abhilash Veeragouni Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning.

HIVE – A PETABYTE SCALE DATA WAREHOUSE USING HADOOP -Abhilash Veeragouni Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning Zhang, Suresh Antony, Hao Liu & Raghotham Murthy

Outline ■Motivation and Origination of Hive ■Hive components ■Data Model and Type System ■Hive Query Language (HiveQL) ■Data storage, SerDe and File Formats ■System Architecture and Components ■Hive Usage ■Performance Benchmark results ■Conclusion and related work

Motivation ■Problem: Size of data sets in industry is growing rapidly!

Distribute the work over many machines! ■Hadoop was used to store and process large data-sets in companies like Facebook, Yahoo, etc.

What is Hadoop? ■Hadoop is an Apache open source framework that allows distributed processing of large data sets across clusters of commodity computers using simple programming models.

Map reduce in Hadoop

Hadoop was not easy! ■HDFS (Hadoop Distributed File System) is used to store the data and MR is used to process the data ■MR code is mainly written in java and is very low level ■Difficult for business analysts to write scalable programs ■Hard to maintain and reuse ■Error prone codes

Hive Origination – Jan 2007 ■Intuition  Make the unstructured data look like tables  SQL-like queries can be written against these tables ■What is Hive?  Open source data warehousing solution built on top of Hadoop  Supports queries expressed in a SQL-like declarative language – HiveQL  Compiled into map-reduce jobs which can be run on Hadoop  Extensible framework to support different file and data formats

Hive Components JDBC/ODBC Browse, Query, DDL Metastore Thrift API HIVE QL Parser Planner Optimizer Execution User-defined MapReduce Scripts FileFormats TextFile SequenceFile RCFile Map ReduceHDFS UDF/UDAF Substr Sum Average SerDe CSV Thrift Regex

Data Model and Type System ■Hive stores data in tables:  Analogous to relational DBs tables  Rows and columns  Supports primitive types – int, float, double, string  And complex types – maps (associative arrays), lists and struct  And nested structures with arbitrary complexity ■Example: CREATE TABLE t1(st string, fl float, li list >);  Can access fields within structs using ‘.’ operator and that of maps or lists using ‘[]’ operator. Eg: t1.li[0]['key'].p2

Data Model and Type System ■Tables are serialized and deserialized using default serializers and deserializers in Hive. ■What if data is prepared by other programs or say legacy data? ■Users can provide a jar which implements SerDe java interface to Hive ■Any arbitrary data format and types encoded can be plugged into Hive by providing this jar ■Example: add jar /jars/myformat.jar; CREATE TABLE t2 ROW FORMAT SERDE 'com.myformat.MySerDe';

Hive Query Language (HiveQL) ■Subset of SQL queries plus extensions ■Traditional SQL features like from clause subqueries, joins – inner, left outer, right outer and outer, Cartesian product, group bys, aggregations, union all, etc. are supported ■Examples:  Only equality predicates are supported in a join predicate and use ANSI join syntax as – SELECT t1.a1 as c1, t2.b1 as c2 FROM t1 JOIN t2 ON (t1.a2 = t2.b2);  Starting with Hive 0.13.0, implicit join notation is supported as – SELECT * FROM table1 t1, table2 t2, table3 t3 WHERE t1.id = t2.id AND t2.id = t3.id AND t1.zipcode = '02535';

Hive Query Language (HiveQL) ■HiveQL has extensions to support analysis expressed as map-reduce programs by users and in the programming language of their choice ■Example:  Canonical word count on a table of documents can be expressed as – FROM ( MAP doctext USING 'python wc_mapper.py' AS (word, cnt) FROM docs CLUSTER BY word ) a REDUCE word, cnt USING 'python wc_reduce.py';

More examples ■What if the distribution criteria between mappers and reducers needs to provide data to reducers such that it is sorted on a set of columns different from the ones that are used to do the distribution – ■Example -  Find all the actions in a session sorted by time: FROM ( FROM session_table SELECT sessionid, tstamp, data DISTRIBUTE BY sessionid SORT BY tstamp ) a REDUCE sessionid, tstamp, data USING 'session_reducer.sh';

More examples ■Hive also allows us to interchange the order of the FROM and SELECT/MAP/REDUCE clauses within a given sub-query - Very helpful when doing multi inserts (reduces multiple scans of input data) ■Example: FROM t3 INSERT OVERWRITE TABLE t2 SELECT t3.c2, count(1) FROM t3 WHERE t3.c1 <= 20 GROUP BY t3.c2 INSERT OVERWRITE DIRECTORY '/output_dir' SELECT t3.c2, avg(t3.c1) FROM t3 WHERE t3.c1 > 20 AND t3.c1 <= 30 GROUP BY t3.c2;

Data Storage ■Tables are logical units in Hive ■Table metadata associates the data in a table to hdfs directories ■Primary data units and their mappings: –Tables – stored in a directory in hdfs –Partitions – stored in the sub-directory of table’s directory –Buckets – stored in a file within the partition’s or table’s directory ■Example: –Creating a partitioned table as – CREATE TABLE test_part(c1 string, c2 int) PARTITIONED BY (ds string, hr int);

Data storage – Cntd ■Table partitions will be stored in /test_part directory of hdfs, Eg: /user/hive/warehouse/test_part ■A partition exists for every distinct value of ds and hr given by user ■A new partition can be created through an INSERT statement or through an ALTER statement that adds a partition to the table ■Example: INSERT OVERWRITE TABLE test_part PARTITION(ds='2016-09-01', hr=10) SELECT * FROM t; ALTER TABLE test_part ADD PARTITION(ds='2016-09-02', hr=11); ■Hive compiler uses the partitions of table’s hdfs directory for pruning when it is needed to scan the data while evaluating a query ■Values of the partition keys are stored with the metadata of the table

Data storage - Buckets ■A bucket is a file within the leaf level directory of a table or a partition ■User can specify the number of buckets needed and the column on which to bucket the data when creating the table ■Example: –CREATE TABLE sales ( id INT, items ARRAY ) PARITIONED BY (ds STRING) CLUSTERED BY (id) INTO 32 BUCKETS; ■This information is used to prune the data in case the user runs the query on a sample of data –SELECT id FROM sales TABLESAMPLE (BUCKET 1 OUT OF 32); ■Map side joins will be faster

Serialization/Deserialization (SerDe) ■Hive can take an implementation of the SerDe java interface provided by the user and associate it with a table or partition ■LazySerDe serialization (default): –Lazily deserializes rows into internal objects –Assumes rows are delimited by a newline (ascii code 13) –Columns within a row are delimited by ctrl-A (ascii code 1) ■Other delimiter options between columns – CREATE TABLE test_delimited(c1 string, c2 int) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\002' LINES TERMINATED BY '\012';

SerDe - Contd ■Delimiters can be specified to delimit the serialized keys and values of maps and lists, Eg: CREATE TABLE test_delimited2(c1 string, c2 list >) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\002' COLLECTION ITEMS TERMINATED BY '\003' MAP KEYS TERMINATED BY '\004'; ■More SerDe options are provided in the hive_contrib.jar like RegexSerDe – enables user to specify Regular expressions for parsing columns from rows

File Formats ■Hadoop files can be stored in different formats Eg: TextInputFormat for text files, SequenceFileInputFormat for binary files, etc. ■Users can implement their own formats and associate them to a table ■Format can be specified when the table is created and no restrictions are imposed by Hive ■Example: CREATE TABLE dest1(key INT, value STRING) STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.SequenceFileInputFormat' OUTPUTFORMAT 'org.apache.hadoop.mapred.SequenceFileOutputFormat‘;

System Architecture and components

■Metastore  The component that stores the system catalog and metadata about tables, columns, partitions etc.  Stored on a traditional RDBMS Driver (Compiler, Optimizer, Executor) Driver (Compiler, Optimizer, Executor) Command Line Interface (CLI) Web Interface Web Interface Thrift Server Metastore JDBC ODBC

System Architecture and components ■Driver  The component that manages the lifecycle of a HiveQL statement as it moves through Hive  Also maintains a session handle and any session statistics Command Line Interface (CLI) Web Interface Web Interface Thrift Server Metastore JDBC ODBC Driver (Compiler, Optimizer, Executor) Driver (Compiler, Optimizer, Executor)

System Architecture and components ■Query Compiler  The component that compiles HiveQL into a directed acyclic graph (DAG) of map/reduce tasks Command Line Interface (CLI) Web Interface Web Interface Thrift Server Metastore JDBC ODBC Driver (Compiler, Optimizer, Executor) Driver (Compiler, Optimizer, Executor)

System Architecture and components ■Optimizer  Executes chain of transformations such that the operator DAG resulted from one transformation is passed as input to the next transformation Command Line Interface (CLI) Web Interface Web Interface Thrift Server Metastore JDBC ODBC Driver (Compiler, Optimizer, Executor) Driver (Compiler, Optimizer, Executor)

System Architecture and components ■Execution Engine  The component that executes the tasks produced by the compiler in proper dependency order  Interacts with the underlying Hadoop instance Command Line Interface (CLI) Web Interface Web Interface Thrift Server Metastore JDBC ODBC Driver (Compiler, Optimizer, Executor) Driver (Compiler, Optimizer, Executor)

System Architecture and components ■HiveServer  The component that provides a thrift interface and a JDBC/ODBC server  Provides a way of integrating Hive with other applications Driver (Compiler, Optimizer, Executor) Driver (Compiler, Optimizer, Executor) Command Line Interface (CLI) Web Interface Web Interface Metastore JDBC ODBC Thrift Server

System Architecture and components ■Client Components  Clients components like the Command Line Interface (CLI), the web UI and JDBC/ODBC driver Driver (Compiler, Optimizer, Executor) Driver (Compiler, Optimizer, Executor) Thrift Server Metastore JDBC ODBC Command Line Interface (CLI) Web Interface Web Interface

System Architecture and components ■Extensibility interfaces  SerDe and ObjectInspector interfaces  UDF (User Defined Function) and UDAF (User Defined Aggregate Function) interfaces that enable users to define custom functions Driver (Compiler, Optimizer, Executor) Driver (Compiler, Optimizer, Executor) Command Line Interface (CLI) Web Interface Web Interface Thrift Server Metastore JDBC ODBC

Query flow

■HiveQL statement submitted via the CLI, the web UI or external client using thrift, odbc or jdbc interfaces. ■Driver first passes the query to compiler – Typical parse, type check and semantic analysis is done using the metadata ■Compiler generates a logical plan ■It is then optimized through rule based optimizer to generate a DAG of map-reduce and hdfs tasks ■Execution engine then execute these tasks in the order of their dependencies using Hadoop

Metastore ■System catalog for Hive. Stores all the information about the tables, their partitions, schemas, columns and types, table locations, SerDe information etc. ■Can be queried or modified using a thrift interface ■This information is stored on traditional RDBMS ■Uses an open source ORM layer called DataNucleus to convert object representations to relational schema and vice versa ■Scalability of the Metastore server is ensured by making sure no metadata calls are made from mappers or reducers of a job ■Xml plan files are generated by compiler containing all the runtime information

Query Compiler ■Parser: Uses Antlr to generate abstract syntax tree (AST) for the query ■Semantic Analyser:  Compiler fetches all the required information from metastore  Verifying column names, type-checking and implicit type conversions are done  Transforms the AST to an internal query representation – Query Block (QB) tree. ■Logical Plan generator:  Convert internal query to logical plan – tree of operators or operator DAG.  Some operators are relational algebra operators like ‘filter’, ‘join’, etc. Some are Hive specific say, reduceSink operator – occurs at map-reduce boundary.

Query Compiler - cntd. ■Optimizer:  Contains a chain of transformations to transform the plan for improved performance  Walks on the operator DAG and does processing actions when certain rules or conditions are satisfied  Five main interfaces involved during the walk - Node, GraphWalker, Dispatcher, Rule and Processor.  A typical transformation involves –  Walking the DAG and for every Node visited, check if the Rule is satisfied  If yes, invoke the corresponding processor for that rule using the dispatcher (does Rule matching)

Transformation

Query Compiler - cntd. ■Typical Transformations:  Column pruning - only the columns that are needed in the query processing are actually projected out of the row  Predicate pushdown - Predicates are pushed down to the scan if possible so that rows can be filtered early in the processing  Partition pruning - Predicates on partitioned columns are used to prune out files of partitions that do not satisfy the predicate  Map side joins – Small tables in a join are replicated in all the mappers and joined with other tables. Eg: SELECT /*+ MAPJOIN(t2) */ t1.c1, t2.c1 FROM t1 JOIN t2 ON(t1.c2 = t2.c2);  Join reordering – Larger tables are streamed in the reducer and smaller tables are kept in memory

More optimizations ■Repartitioning of data to handle skews in GROUP BY processing –  Real world datasets have a power law distribution on columns  Most of the data might get sent to few reducers  Better way: Use two-stage map-reduce!  Stage one: Data is randomly distributed to the reducers to compute partial aggregations  Stage two: Partial aggregations are distributed on the GROUP BY columns to the reducers in the second MR stage  Triggered in Hive by setting a parameter – set hive.groupby.skewindata=true; SELECT t1.c1, sum(t1.c2) FROM t1 GROUP BY t1.c1;

More optimizations ■Hash based partial aggregations in the mappers –  Hive does hash based partial aggregations within the mappers to reduce the data sent to the reducers  This reduces the time spent in sorting and merging data and gives a performance gain.  Controlled by parameter - hive.map.aggr.hash.percentmemory  Specifies fraction of mapper memory that can be used to hold the hash table. Eg: 0.5  As soon as hash table exceeds the limit, partial aggregations are sent to reducer

Query compiler – cntd. ■Physical plan generator –  Logical plan after optimization is split into multiple map/reduce and hdfs tasks  A Multi-table insert query – FROM (SELECT a.status, b.school, b.gender FROM status_updates a JOIN profiles b ON (a.userid = b.userid AND a.ds='2009-03-20' )) subq1 INSERT OVERWRITE TABLE gender_summary PARTITION(ds='2009-03-20') SELECT subq1.gender, COUNT(1) GROUP BY subq1.gender INSERT OVERWRITE TABLE school_summary PARTITION(ds='2009-03-20') SELECT subq1.school, COUNT(1) GROUP BY subq1.school

Query Plan

■Nodes in the plan are physical operators – Serializable objects ■Edges represent flow of data between operators ■Example: ReduceSinkOperator – it occurs at the map-reduce boundary ■Its descriptor contains the reduction keys which are used as the reduction keys in the map reduce boundary ■Few other operators: TableScanOperator, JoinOperator, SelectOperator, FileSinkOperator, FilterOperator, etc ■The query plan is serialized and written to a file

Execution Engine ■The tasks are executed in the order of their dependencies ■A map/reduce task first serializes its part of the plan into a plan.xml file ■This file is added to the job cache for the task and ExecMapper and ExecReducer instances are spawned using Hadoop ■Each of these classes executes relevant parts of the DAG ■Final results are stored in a temporary location ■At the end of entire query, final data is either moved to a desired location or fetched from the temporary location

Hive Usage ■Types of applications:  Summarization / Log processing  Eg: Daily/Weekly aggregations of click counts, etc.  Ad hoc analysis  Reporting dashboards  Data/Text mining  Business intelligence  Advertising delivery  Spam detection

Benchmark results Fig 1. Hadoop, Hive and PIG query times with LZO compression of intermediate map output data

Benchmark results Fig 2. Hadoop, Hive and PIG query times without compression of intermediate map output data

Conclusion ■Pros  Easy way to process large scale data  SQL-like queries – known to many business users  Programmability  Provide many user defined interfaces to extend  Interoperability with other databases ■Cons  Supports only a subset of SQL queries  Files in HDFS are immutable  No easy way to append data

Related work ■Google: Sawzall ■Yahoo: Pig ■Parallel databases: Gamma, Bubba, Volcano ■IBM: JAQL ■Microsoft: Scope, DradLINQ

References ■[1] Thusoo, A., Sarma, J. S., Jain, N., Shao, Z., Chakka, P., Zhang, N., Antony, S., Liu, H., and Murthy, R. 2010. Hive — A petabyte scale data warehouse using Hadoop. In Proceedings of the International Conference on Data Engineering. 996–1005. ■[2] https://cwiki.apache.org/confluence/display/Hive/Designhttps://cwiki.apache.org/confluence/display/Hive/Design ■[3] https://www.tutorialspoint.com/hadoop https://www.tutorialspoint.com/hadoop ■[4] http://www.slideshare.net/Skillspeed/big-data-analytics-using-hivehttp://www.slideshare.net/Skillspeed/big-data-analytics-using-hive ■[5] Hive wiki at http://www.apache.org/hadoop/hive.http://www.apache.org/hadoop/hive ■[6] Hive Performance Benchmark. https://issues.apache.org/jira/secure/attachment/12413737/hive_benchmark_20 09-07-12.pdf https://issues.apache.org/jira/secure/attachment/12413737/hive_benchmark_20 09-07-12.pdf ■[7] https://www.safaribooksonline.com/library/view/programming- hive/9781449326944/ch01.htmlhttps://www.safaribooksonline.com/library/view/programming- hive/9781449326944/ch01.html

Q & A ?

Thank you!

HIVE – A PETABYTE SCALE DATA WAREHOUSE USING HADOOP -Abhilash Veeragouni Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning.

Similar presentations

Presentation on theme: "HIVE – A PETABYTE SCALE DATA WAREHOUSE USING HADOOP -Abhilash Veeragouni Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

HIVE – A PETABYTE SCALE DATA WAREHOUSE USING HADOOP -Abhilash Veeragouni Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning.

Similar presentations

Presentation on theme: "HIVE – A PETABYTE SCALE DATA WAREHOUSE USING HADOOP -Abhilash Veeragouni Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning."— Presentation transcript:

Similar presentations

About project

Feedback