Hive on steroid Project stinger.

Hive on steroid Project stinger

Who Am I? Olivier Renault Hortonworks Solution engineer for EMEA
Join Hortonworks EMEA in Jan 2013 Eucalyptus – Open source Cloud solution Red Hat – Solution engineer

What’s Hive ? Use HiveQL Hive translate SQL query into MapReduce job using De facto SQL interface in Hadoop Entry point for most BI tools ODBC HCatalog merge with Hive Metadata server Hive is able to query Pb of data

Hive: Strength Through Community
Dozens of Vendors integrate with Hive Teradata Microsoft Microstrategy Tableau Karmasphere Datameer Information Builders SAP Oracle Actuate QlikView SAS arcplan Pentaho Jaspersoft Tibco Talend Informatica … Loyal Open Source Community and Real Corporate Interest/Contributions Facebook Teradata SAP Intel Microsoft Huawei Yahoo … Open Source End Users Vendors Countless Enterprises Use Hive as the defacto SQL interface to Hadoop data

Problem : Hive was slow …
Hive is able to interact with visualization tools but you needed to be patient … February 2013, Hortonworks launch Stinger initiative. The aim is to improve Hive performance by 100x Bringing Hive in the interactive query world

= 100X + + Stinger Initiative Community initiative around Hive
Enables Hive to support interactive workloads Enhances Hive’s standard SQL interface for Hadoop Improves existing tools & preserves investments Execution Engine Tez Query Planner Hive File Format ORC file + + = 100X Buzz about low latency access in Hadoop

Batch AND Interactive SQL-IN-Hadoop
Stinger Initiative A broad, community-based effort to drive the next generation of HIVE Stinger Project (announced February 2013) Delivered Hive 0.11 (HDP 1.3) Phase One Base Optimizations SQL Analytic Functions ORCFile, Modern File Format Goals: Speed Improve Hive query performance by 100X to allow for interactive query times (seconds) Delivered Hive 0.12 (HDP 2.0) Phase Two VARCHAR, DATE Types ORCFile predicate pushdown Advanced Optimizations Performance Boosts via YARN Scale The only SQL interface to Hadoop designed for queries that scale from TB to PB Phase Three Hive on Apache Tez Query Service Buffer Cache Cost Based Optimizer (Optiq) Vectorized Processing SQL Support broadest range of SQL semantics for analytic applications running against Hadoop Coming Soon …all IN Hadoop

Hive : Base optimization
New dags, analytics tools, ..

Hive Advanced Analytics
Add OVER clause to support windowing queries With standard arguments Ranking functions rank, ntile, row_number, dense_rank With analytics functions: cume_dist, first_value, lag, last_value, lead, percentile_cont, percentile_disc, percent_rank Add CUBE and ROLLUP Easily create summaries of your data Extend aggregation functions STDDEV, VAR

Hive Data Type Conformance
Extend Hive to support additional types from SQL Improves applications and interoperability between tools Specific additions Add fixed point NUMERIC and DECIMAL type (in progress) Add VARCHAR and CHAR types with limited field size Add DATETIME Add size ranges from 1 to 53 for FLOAT Add synonyms for compatibility BLOB for BINARY TEXT for STRING REAL for FLOAT

SQL: Enhancing SQL Semantics
SQL Compliance Hive 12 provides a wide array of SQL datatypes and semantics so your existing tools integrate more seamlessly with Hadoop Hive SQL Datatypes Hive SQL Semantics INT SELECT, INSERT TINYINT/SMALLINT/BIGINT GROUP BY, ORDER BY, SORT BY BOOLEAN JOIN on explicit join key FLOAT Inner, outer, cross and semi joins DOUBLE Sub-queries in FROM clause STRING ROLLUP and CUBE TIMESTAMP UNION BINARY Windowing Functions (OVER, RANK, etc) DECIMAL Custom Java UDFs ARRAY, MAP, STRUCT, UNION Standard Aggregation (SUM, AVG, etc.) DATE Advanced UDFs (ngram, Xpath, URL) VARCHAR Sub-queries in WHERE, HAVING CHAR Expanded JOIN Syntax SQL Compliant Security (GRANT, etc.) INSERT/UPDATE/DELETE (ACID) With Hive and Stinger we are focused on enabling the SQL ecosystem and to do that we’ve put Hive on a clear roadmap to SQL compliance. That includes adding critical datatypes like character and date types as well as implementing common SQL semantics seen in most databases. Available Hive 0.12 Roadmap

Example Benchmark Spec
The TPC-DS benchmark data+query set Query 27 big table(store_sales) joins lots of small tables A.K.A Star Schema Join What does Query 27 do? For all items sold in stores located in specified states during a given year, find the average quantity, average list price, average list sales price, average coupon amount for a given gender, marital status, education and customer demographic..

Query 27 - Star Schema Join
Derived from TPC-DS Query 27 SELECT col5, avg(col6) FROM store_sales_fact ssf join item_dim on (ssf.col1 = item_dim .col1) join date_dim on (ssf.col2 = date_dim.col2 join custdmgrphcs_dim on (ssf.col3 =custdmgrphcs_dim.col3) join store_dim on (ssf.col4 = store_dim.col4) GROUP BY col5 ORDER BY col5 LIMIT 100; 41 GB 58 MB 11MB 80MB 106 KB

New Query Planner

Query27 Execution Before Hive 11-Text Format
The intermediate output of each job is written to HDFS Query spawned 5 MR Jobs 179 total mappers got executed Query Response Time

Query27 Execution With Hive 11-Text Format
Job 1 of 1 – Each Mapper loads into memory the 4 small dimension tables and streams parts of the large fact table. Joins then occur in Mapper hence the name MapJoin Query spawned of 1 job with Hive 11 compared to 5 MR Jobs with Hive 10 Increase in performance with Hive 11 as query time went down from 21 minutes to about 4 minutes

Query27 Execution With Hive 11- RC Format
Conversion from Text to RC file format decreased size of dimension data set from 38 GB to 8.21 GB Smaller file equates to less IO causing the query time to decrease from 246 seconds to 136 seconds

Query27 Execution With Hive 11- ORC Format
ORC File type consolidates data more tighly than RCFile as the size of dataset decreased from 8.21 GB to 2.83 GB Smaller file equates to less IO causing the query time to decrease from 136 seconds to 104 seconds

Summary of Results File Type Number of MR Jobs Input Size Mappers Time
Text/Hive 10 5 43.1 GB 179 1260 Seconds Text/Hive 11 1 38 GB 151 246 seconds RC/Hive 11 8.21 GB 76 136 seconds ORC/Hive 11 2.83 GB 38 104 seconds RC/Hive 11/Partitioned/Bucketed 1.73 GB 19 ORC/Hive 11/Partitioned/Bucketed 687 MB 27 79.62

ORC file format Optimized RC File

ORCFile - Optimized Column Storage
Make a better columnar storage file Evolve based on Google Dremel format Decompose complex row types into primitive fields Better compression and projection Only read bytes from HDFS for the required columns. Store column level aggregates in the files Only need to read the file meta information for common queries Stored both for file and each section of a file Aggregates: min, max, sum, average, count Allows fast access by sorted columns Ability to add bloom filters for columns Enables quick checks for whether a value is present

ORCFile - File Layout

Interactive Query at Scale
Sustained Query Times Apache Hive 0.12 provides sustained acceptable query times even at petabyte scale Smaller Footprint Better encoding with ORC in Apache Hive 0.12 reduces resource requirements for your cluster File Size Comparison Across Encoding Methods Dataset: TPC-DS Scale 500 Dataset 585 GB (Original Size) 505 GB (14% Smaller) 221 GB (62% Smaller) Impala 131 GB (78% Smaller) Hive 12 Larger Block Sizes Columnar format arranges columns adjacent within the file for compression & fast access Encoded with Text Encoded with RCFile Encoded with Parquet Encoded with ORCFile

A New Hadoop Data Processing Framework
Apache Tez A New Hadoop Data Processing Framework

Moving Hadoop Beyond MapReduce
Low level data-processing execution engine Built on YARN Enables pipelining of jobs Removes task and job launch times Does not write intermediate output to HDFS Much lighter disk and network usage New base of MapReduce, Hive, Pig, Cascading etc. Hive and Pig jobs no longer need to move to the end of the queue between steps in the pipeline

FastQuery: Beyond Batch with YARN
Tez Generalizes Map-Reduce Simplified execution plans process data more efficiently Always-On Tez Service Low latency processing for all Hadoop data processing

Apache Tez as the new Primitive
Apache Tez as Base MapReduce as Base HDFS2 (redundant, reliable storage) YARN (cluster resource management) Tez (execution engine) HADOOP 2.0 Data Flow Pig SQL Hive Others (cascading) Batch MapReduce Real Time Stream Processing Storm Online Data HBase, Accumulo HADOOP 1.0 Pig (data flow) Hive (sql) Others (cascading) MapReduce (cluster resource management & data processing) HDFS (redundant, reliable storage)

Hive-on-MR vs. Hive-on-Tez
SELECT a.x, AVERAGE(b.y) AS avg FROM a JOIN b ON (a.id = b.id) GROUP BY a UNION SELECT x, AVERAGE(y) AS AVG FROM c GROUP BY x ORDER BY AVG; Tez avoids unneeded writes to HDFS Hive – MR Hive – Tez M M M M M M SELECT a.state, c.itemId SELECT a.state SELECT b.id SELECT b.id R R R R M M HDFS M M M M R JOIN (a, c) SELECT c.price R JOIN (a, c) R HDFS R HDFS JOIN(a, b) GROUP BY a.state COUNT(*) AVERAGE(c.price) M M JOIN(a, b) GROUP BY a.state COUNT(*) AVERAGE(c.price) R R

Speed: Interactive Query In Hadoop
Test Cluster: 200 GB Data (ORCFile) 20 Nodes, 24GB RAM each, 6x disk each Query 27: Pricing Analytics using Star Schema Join Query 82: Inventory Analytics Joining 2 Large Fact Tables 1400s 39s 7.2s TPC-DS Query 27 3200s 65s 14.9s TPC-DS Query 82 190x Improvement 200x Improvement Hive 10 Trunk (Phase 3) Hive 0.11 (Phase 1) All Results at Scale Factor 200 (Approximately 200GB Data)

Hortonworks …the Bull Elephant of Hadoop Innovation
There is NO second place Hortonworks …the Bull Elephant of Hadoop Innovation

hortonworks.com hortonworks.com/sandbox
Thank You hortonworks.com hortonworks.com/sandbox

Hive on steroid Project stinger.

Similar presentations

Presentation on theme: "Hive on steroid Project stinger."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Hive on steroid Project stinger.

Similar presentations

Presentation on theme: "Hive on steroid Project stinger."— Presentation transcript:

Similar presentations

About project

Feedback