Download presentation
Presentation is loading. Please wait.
Published byEileen Foster Modified over 9 years ago
1
Penwell Debug Intel Confidential BRIEF OVERVIEW OF HIVE Jonathan Brauer ESE 380L Feb 2014 1
2
Penwell Debug Intel Confidential Hive is a Massively Parallel Data Warehousing environment Hive provides SQL like programming environment for Hadoop –Hadoop becoming common in “Big Data” houses Hadoop makes it relatively easy to quickly implement MapReduce jobs, but often requires plug-ins or APIs be used to write jobs –Engineers though familiar with SQL and not MapReduce may be more productive with SQL. Hive queries are MapReduce operations 2 Overview
3
Penwell Debug Intel Confidential 3 Background on Hadoop What is Hadoop? –Open source implementation of a a MapReduce environment –A distributed filesystem for storing data – Hadoop Distributed File System (HDFS) –Multiple copies of data –Very large files can be handled –Files are broken up into “buckets” commonly 128MB MapReduce consists of a Map function and Reduce function –Map functions are applied to all data –Reduce functions collate map output –Example in SQL is: Map does SELECT on rows and the Reducer could SORT the output
4
Penwell Debug Intel Confidential Hive allows developers to with SQL background to ramp rapidly and perform Hive queries Open Source Apache project Hive is compatible with other MapReduce operations in an infrastructure – some groups can use Hive and others native MapReduce Can share tables with Hbase Hive has built in functions for reducing data such as sampling: –Block Sampling –Bucket Sampling –Deterministic Sampling –Non-Deterministic Sampling 4 Advantages
5
Penwell Debug Intel Confidential Not for real time unless very small data (why are you using Hadoop?) Row updates are not generally allowed Hive queries can be very time consuming –Similar to RDBMS some experience and knowledge of writing efficient queries is necessary in Hive Hive features require extending and modifying SQL operations and some SQL operations behave differently –SORT BY vs. ORDER BY (Local vs. Global reducer behavior) Large data sizes make some queries impossible to finish due to individual system resources in a meaningful time (doing an ORDER by on all columns in a PetaByte search is a bad idea). Queries are still IO bound Hive optimizations still on-going Consider using Hadoop natively, Hbase (Fast, row edit), or Pig (transforms) 5 Disadvantages
6
Penwell Debug Intel Confidential SELECT a.userid, b.text FROM users a TABLESAMPLE(1 PERCENT) JOIN data b ON a.dat = ‘2012-03-15’ AND b.dat = ‘2012-03-15’ AND a.userid = b.id 6 Example
7
Penwell Debug Intel Confidential Questions? 7
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.