Hive. What is Hive? Data warehousing layer on top of Hadoop – table abstractions SQL-like language (HiveQL) for “batch” data processing SQL is translated.

Slides:



Advertisements
Similar presentations
HBase and Hive at StumbleUpon
Advertisements

Shark Hive SQL on Spark Michael Armbrust.
Hui Li Pig Tutorial Hui Li Some material adapted from slides by Adam Kawa the 3rd meeting of WHUG June 21, 2012.
CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.
Hive - A Warehousing Solution Over a Map-Reduce Framework.
HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
Introduction to Hive Liyin Tang
Hive: A data warehouse on Hadoop
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.
A warehouse solution over map-reduce framework Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff.
Raghav Ayyamani. Copyright Ellis Horowitz, Why Another Data Warehousing System? Problem : Data, data and more data Several TBs of data everyday.
Hive – A Warehousing Solution Over a Map-Reduce Framework Presented by: Atul Bohara Feb 18, 2014.
Confidential ODBC May 7, Features What is ODBC? Why Create an ODBC Driver for Rochade? How do we Expose Rochade as Relational Transformation.
SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL.
Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.
Advance Computer Programming Java Database Connectivity (JDBC) – In order to connect a Java application to a database, you need to use a JDBC driver. –
Processing Data using Amazon Elastic MapReduce and Apache Hive Team Members Frank Paladino Aravind Yeluripiti.
DLRL Cluster Matt Bollinger, Joseph Pontani, Adam Lech Client: Sunshin Lee CS4624 Capstone Project March 3, 2014 Virginia Tech, Blacksburg, VA.
Analytics Map Reduce Query Insight Hive Pig Hadoop SQL Map Reduce Business Intelligence Predictive Operational Interactive Visualization Exploratory.
Database System Concepts and Architecture Lecture # 3 22 June 2012 National University of Computer and Emerging Sciences.
Hive : A Petabyte Scale Data Warehouse Using Hadoop
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Cloud Computing Other High-level parallel processing languages Keke Chen.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
NoSQL continued CMSC 461 Michael Wilson. MongoDB  MongoDB is another NoSQL solution  Provides a bit more structure than a solution like Accumulo  Data.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark Cluster Monitoring 2.
Introduction to Hadoop and HDFS
Distributed Systems Fall 2014 Zubair Amjad. Outline Motivation What is Sqoop? How Sqoop works? Sqoop Architecture Import Export Sqoop Connectors Sqoop.
Hive Facebook 2009.
Data storing and data access. Plan Basic Java API for HBase – demo Bulk data loading Hands-on – Distributed storage for user files SQL on noSQL Summary.
MapReduce.
MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.
An Introduction to HDInsight June 27 th,
A NoSQL Database - Hive Dania Abed Rabbou.
Hive – SQL on top of Hadoop
SLIDE 1IS 257 – Fall 2014 NewSQL and VoltDB University of California, Berkeley School of Information IS 257: Database Management.
Data storing and data access. Adding a row with Java API import org.apache.hadoop.hbase.* 1.Configuration creation Configuration config = HBaseConfiguration.create();
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
MapReduce. What is MapReduce? (1) A programing model for parallel processing of a distributed data on a cluster It is an ideal solution for processing.
DATABASE CONNECTIVITY TO MYSQL. Introduction =>A real life application needs to manipulate data stored in a Database. =>A database is a collection of.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
IBM Research ® © 2007 IBM Corporation A Brief Overview of Hadoop Eco-System.
Nov 2006 Google released the paper on BigTable.
Impala. Impala: Goals General-purpose SQL query engine for Hadoop High performance – C++ implementation – runtime code generation (using LLVM) – direct.
Basics of JDBC Session 14.
ECMM6018 Enterprise Networking For Electronic Commerce Tutorial 6 CGI/Perl and databases.
Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.
Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.
HADOOP Course Content By Mr. Kalyan, 7+ Years of Realtime Exp. M.Tech, IIT Kharagpur, Gold Medalist. Introduction to Big Data and Hadoop Big Data › What.
Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.
JDBC.
Moscow, November 16th, 2011 The Hadoop Ecosystem Kai Voigt, Cloudera Inc.
Image taken from: slideshare
HIVE A Warehousing Solution Over a MapReduce Framework
Hadoop.
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
A Warehousing Solution Over a Map-Reduce Framework
Hive Mr. Sriram
Powering real-time analytics on Xfinity using Kudu
Hadoop EcoSystem B.Ramamurthy.
Rekha Singhal, Amol Khanapurkar, TCS Mumbai.
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Server & Tools Business
Setup Sqoop.
Charles Tappert Seidenberg School of CSIS, Pace University
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
05 | Processing Big Data with Hive
Pig Hive HBase Zookeeper
Presentation transcript:

Hive

What is Hive? Data warehousing layer on top of Hadoop – table abstractions SQL-like language (HiveQL) for “batch” data processing SQL is translated a one or series of MapReduce executions Good for ad-hoc reporting queries on HDFS data – however generated MR executions can be sub optimal

What is not… Not a relational database – not transactions – no row updates – no indexes No interactive querying No auditing No accounts

Hive overview Metadata (table definitions, data locations Metadata (table definitions, data locations Hive Engine (compiler, optimizer, executer) Hive Engine (compiler, optimizer, executer) HDFS Hadoop Distributed File System YARN Cluster resource manager MapReduce Lookup s Direct data reading Data processing

Metastore Contains tables definitions and data locations Stored in additional RDBMS – very often on one of a cluster machines – MySQL, PostgreSQL, Oracle, Derby… Used by many other Hadoop components – via HCatalog service

Hive Table Table – Definitions stored in a Hive metastore (RDBMS) – Data are stored in files on HDFS Table definition and data are not tightly coupled Data files can be in various formats – but a type has to by unique within a single table Table partitioning and bucketing is supported EXTERNAL tables (DMLs are not possible)

Interacting with Hive Remotely – JDBC and ODBC drivers – Thrift API Locally – hive-shell (deprecated) – beeline (via JDBC, HiveServer2 support)

Hadoop Cluster Machine Hive Architecture JDBC ODBC Thrift API Hadoop Cluster machine HiveServer2 Thrift Service Hive Driver Metastore Server Thrift Service Metadata RDBMS (table definitions, data locations Metadata RDBMS (table definitions, data locations DB driver Compilation Optimization Execution Cluster Node HDFS MapReduce tasks Cluster Node HDFS MapReduce tasks Cluster Node HDFS MapReduce tasks Cluster Node HDFS MapReduce tasks SQL master node HDFS MapReduce task YARN Request Lookup s

Operations Based on SQL-92 specification DDL – CREATE TABLE, ALTER TABLE, DROP TABLE…. DML – INSERT, INSERT OVERWRITE… SQL – SELECT… DISTINCT…JOIN WHERE…GROUP BY…HAVING…ORDER BY…LIMIT – REGEXP supported – Subqueries only in the FROM cluase

Data Types TINYINT – 1 byte BOOLEAN SMALLINT – 2 bytes INT – 4 bytes BIGINT – 8 bytes DOUBLE STRING STRUCT – named fields MAP – key-value pairs collection ARRAY – order collection of records in the same type

Other features Views Build in functions – floor, rand, cast, case, if, concat, substr etc – ‘show functions’ User defined functions – have to be delivered in jar

MR hands on (1) The problem – Q: „What follows two rainy days in the Geneva region?” – A: „Monday” The goal – Proof if the theory is true or false Solution – Lets take meteo data from GVA and build a histogram of bad weather days of a week Mon | Tue |Wed |Thu | Fr | Sat | Sun Bad weather days count ?

MR hands on (2) – The source data ( – Source: Last 3 years of weather data taken at GVA airport – CSV format What is a bad weather day?: – Weather anomalies between 8am and 10pm "Local time in Geneva (airport)";"T";"P0";"P";"U";"DD";"Ff";"ff10";"WW";"W'W'";"c";"VV";"Td"; " :50";"18.0";"730.4";"767.3";"100";"variable wind direction";"2";"";"";"";"No Significant Clouds";"10.0 and more";"18.0"; " :20";"18.0";"730.4";"767.3";"94";"variable wind direction";"1";"";"";"";"Few clouds (10-30%) 300 m, scattered clouds (40-50%) 3300 m";"10.0 and more";"17.0"; " :50";"19.0";"730.5";"767.3";"88";"Wind blowing from the west";"2";"";"";"";"Few clouds (10-30%) 300 m, broken clouds (60-90%) 5400 m";"10.0 and more";"17.0"; " :20";"19.0";"729.9";"766.6";"83";"Wind blowing from the south-east";"4";"";"";"";"Few clouds (10-30%) 300 m, scattered clouds (40-50%) 2400 m, overcast (100%) 4500 m";"10.0 and more";"16.0"; " :50";"19.0";"729.9";"766.6";"94";"Wind blowing from the east-northeast";"5";"";"Light shower(s), rain";"";"Few clouds (10-30%) 1800 m, scattered clouds (40-50%) 2400 m, broken clouds (60-90%) 3000 m";"10.0 and more";"18.0"; " :20";"20.0";"730.7";"767.3";"88";"Wind blowing from the north-west";"2";"";"Light shower(s), rain, in the vicinity thunderstorm";"";"Few clouds (10-30%) 1800 m, cumulonimbus clouds, broken clouds (60-90%) 2400 m";"10.0 and more";"18.0"; " :50";"22.0";"730.2";"766.6";"73";"Wind blowing from the south";"7";"";"Thunderstorm";"";"Few clouds (10-30%) 1800 m, cumulonimbus clouds, scattered clouds (40-50%) 2100 m, broken clouds (60-90%) 3000 m";"10.0 and more";"17.0"; " :20";"23.0";"729.6";"765.8";"78";"Wind blowing from the west-southwest";"4";"";"Light shower(s), rain, in the vicinity thunderstorm";"";"Few clouds (10-30%) 1740 m, cumulonimbus clouds, scattered clouds (40-50%) 2100 m, broken clouds (60-90%) 3000 m";"10.0 and more";"19.0"; " :50";"23.0";"728.8";"765.0";"65";"variable wind direction";"2";"";"In the vicinity thunderstorm";"";"Scattered clouds (40-50%) 1950 m, cumulonimbus clouds, scattered clouds (40-50%) 2100 m, broken clouds (60-90%) 3300 m";"10.0 and more";"16.0"; " :20";"23.0";"728.2";"764.3";"74";"Wind blowing from the west-northwest";"4";"";"Light thunderstorm, rain";"";"Scattered clouds (40-50%) 1950 m, cumulonimbus clouds, scattered clouds (40-50%) 2100 m, broken clouds (60-90%) 3300 m";"10.0 and more";"18.0"; " :50";"28.0";"728.0";"763.5";"45";"Wind blowing from the south-west";"5";"11";"Thunderstorm";"";"Scattered clouds (40-50%) 1950 m, cumulonimbus clouds, scattered clouds (40-50%) 2100 m, broken clouds (60-90%) 6300 m";"10.0 and more";"15.0"; " :20";"28.0";"728.0";"763.5";"42";"Wind blowing from the north-northeast";"2";"";"In the vicinity thunderstorm";"";"Few clouds (10-30%) 1950 m, cumulonimbus clouds, broken clouds (60-90%) 6300 m";"10.0 and more";"14.0";

MR hands on (3) Designing MapReduce flow " :50";"18.0";... " :20";"18.0„;... " :50";"19.0";... Input Data Input Data Grouping: 1)sum(Value) by key Emiting: Reduce Intermediate output Map Filtering: 1) 8 < HH < 22 2) col9 != „” ; Emiting: Filtering: 1) 8 < HH < 22 2) col9 != „” ; Emiting: Filtering: 1) 8 < HH < 22 2) col9 != „” ; Emiting: Result Result (final) Filtering: 1) 8 < HH < 22 2) col9 != „” ; Emiting: Filtering: 1) 8 < HH < 22 2) col9 != „” ; Emiting: Transforming: 1) Date => day of a week Emiting: Map Intermediate output Grouping: 1)sum(Value) by key Emiting: Reduce

SQL query for the problem select from_unixtime(unix_timestamp(a.day,'yyyy-MM-dd'),'u'), count(*) from ( select from_unixtime( unix_timestamp(time,’dd.MM.yyyy HH:mm'),'yyyy-MM-dd') as day, count(*) as w from data where weather!=‘’ and hour( from_unixtime(unix_timestamp(time,’dd.MM.yyyy HH:mm'))) between 8 and 22 group by from_unixtime(unix_timestamp(time,’dd.MM.yyyy HH:mm'),'yyyy-MM-dd') having count(*)>=1 ) a group by from_unixtime(unix_timestamp(a.day,'yyyy-MM-dd'),'u');

Using Hive CLIs Starting hive shell (deprecated) Use beeline instead (supports new HiveServer2) – Connection in remote mode – Connection in embedded mode > hive >beeline !connect jdbc:hive2://localhost:10000/default !connect jdbc:hive2://

Useful Hive commands Get all databases Set a default database Show tables in a database Show table definition Explaining plan show databases use show tables desc EXPLAIN [EXTENDED|DEPENDENCY|AUTHORIZATION]

Hands on Hive (1) All scripts are available with: To execute a script in beeline Querying the a table – query_.sql Creation of an external table (name=geneva) – external.sql Creation of a external table without “ (name=geneva_clean) – external_clean.sql Creation of a local table (name=weather) – standard.sql wget unzip hive.zip; cd hive !run

Hands-on Hive (2) Creation of a partitioned table (name=weather_part) – partitioned.sql Creation of partitioned and bucketed table (name=weather_buck) – bucketing.sql Creation of a table stored in a parquet format (name=weather_parquet) – parquet.sql Creation of a compressed table (name=weather_compr) – compressed.sql Explain plan – explain.sql Table statistics – stats.sql

Hands on Hive (JDBC) Compile the code Run – set classpath – Execute javac HiveJdbcClient.java java -cp $CLASSPATH HiveJdbcClient source./setHiveEnv.sh

This talk does not include… SerDe – Serializer and Deserializer of data – There are predefined for common data formats – Custom Ser/De can be written by a user Writing UDF Querying Hbase

Summary Provides table abstraction layer on HDFS data SQL on Hadoop translated to MapReduce jobs The Hive Query Language has some limitations For batch processing, not interactive Can append data – But not row updates or deletions Uses external metastore for keeping metadata – The store is used by most of SQL on Hadoop technologies