Software and Services Group “Project Panthera”: Better Analytics with SQL, MapReduce and HBase Jason Dai Principal Engineer Intel SSG (Software and Services.

Slides:

Advertisements

Similar presentations

HBase and Hive at StumbleUpon

Advertisements

Phoenix We put the SQL back in NoSQL James Taylor Demos:

Connecting to Databases. relational databases tables and relations accessed using SQL database -specific functionality –transaction processing commit.

SQOOP HCatalog Integration

1 Copyright © 2011, Oracle and/or its affiliates. All rights reserved.

CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.

Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html

CS525: Special Topics in DBs Large-Scale Data Management HBase Spring 2013 WPI, Mohamed Eltabakh 1.

A comparison of MySQL And Oracle Jeremy Haubrich.

Hive - A Warehousing Solution Over a Map-Reduce Framework.

The Hadoop RDBMS Replace Oracle with Hadoop John Leach CTO and Co-Founder J.

Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.

1 HYRISE – A Main Memory Hybrid Storage Engine By: Martin Grund, Jens Krüger, Hasso Plattner, Alexander Zeier, Philippe Cudre-Mauroux, Samuel Madden, VLDB.

Hive: A data warehouse on Hadoop

Chapter 3 Database Management

Working with SQL and PL/SQL/ Session 1 / 1 of 27 SQL Server Architecture.

UT DALLAS Erik Jonsson School of Engineering & Computer Science FEARLESS engineering Secure Data Storage and Retrieval in the Cloud Bhavani Thuraisingham,

CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.

Raghav Ayyamani. Copyright Ellis Horowitz, Why Another Data Warehousing System? Problem : Data, data and more data Several TBs of data everyday.

Hive – A Warehousing Solution Over a Map-Reduce Framework Presented by: Atul Bohara Feb 18, 2014.

CVSQL 2 The Design. System Overview System Components CVSQL Server –Three network interfaces –Modular data source provider framework –Decoupled SQL parsing.

Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.

Software and Services Group SQL (92 and Beyond) Support for Hive Jason Dai Principal Engineer Intel SSG (Software and Services Group)

Hive : A Petabyte Scale Data Warehouse Using Hadoop

Cloud Computing Other High-level parallel processing languages Keke Chen.

NoSQL continued CMSC 461 Michael Wilson. MongoDB  MongoDB is another NoSQL solution  Provides a bit more structure than a solution like Accumulo  Data.

Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.

Penwell Debug Intel Confidential BRIEF OVERVIEW OF HIVE Jonathan Brauer ESE 380L Feb

Hive Facebook 2009.

Data storing and data access. Plan Basic Java API for HBase – demo Bulk data loading Hands-on – Distributed storage for user files SQL on noSQL Summary.

Performance Evaluation on Hadoop Hbase By Abhinav Gopisetty Manish Kantamneni.

SQL SQL Server : Overview SQL : Overview Types of SQL Database : Creation Tables : Creation & Manipulation Data : Creation & Manipulation Data : Retrieving.

Top 10 SQL-on-Hadoop Pitfalls Monte Zweben CEO, Splice Machine.

The Oracle9i Multi-Terabyte Data Warehouse Jeff Parker Manager Data Warehouse Development Amazon.com Session id:

Key/Value Stores CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

Data storing and data access. Adding a row with Java API import org.apache.hadoop.hbase.* 1.Configuration creation Configuration config = HBaseConfiguration.create();

MANAGING DATA RESOURCES ~ pertemuan 7 ~ Oleh: Ir. Abdul Hayat, MTI.

Database Systems Design, Implementation, and Management Coronel | Morris 11e ©2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or.

1. 2 Preface In the time since the 1986 edition of this book, the world of compiler design has changed significantly 3.

Hive. What is Hive? Data warehousing layer on top of Hadoop – table abstractions SQL-like language (HiveQL) for “batch” data processing SQL is translated.

Oracle11g: PL/SQL Programming Chapter 3 Handling Data in PL/SQL Blocks.

Chapter 5 : Integrity And Security  Domain Constraints  Referential Integrity  Security  Triggers  Authorization  Authorization in SQL  Views 

© 2003 Prentice Hall, Inc.3-1 Chapter 3 Database Management Information Systems Today Leonard Jessup and Joseph Valacich.

BACS 287 Big Data & NoSQL 2016 by Jones & Bartlett Learning LLC.

Last Updated : 27 th April 2004 Center of Excellence Data Warehousing Group Teradata Performance Optimization.

Cloudera Kudu Introduction

Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.

Text TCS INTERNAL Oracle PL/SQL – Introduction. TCS INTERNAL PL SQL Introduction PLSQL means Procedural Language extension of SQL. PLSQL is a database.

SQL Basics Review Reviewing what we’ve learned so far…….

BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Authored by Sameer Agarwal, et. al. Presented by Atul Sandur.

11 Copyright © 2009, Oracle. All rights reserved. Enhancing ETL Performance.

HBase Mohamed Eltabakh

Distributed Programming in “Big Data” Systems Pramod Bhatotia wp

INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER

A Warehousing Solution Over a Map-Reduce Framework

Hive Mr. Sriram

Powering real-time analytics on Xfinity using Kudu

Hadoop EcoSystem B.Ramamurthy.

Introduction to PIG, HIVE, HBASE & ZOOKEEPER

CSE 491/891 Lecture 21 (Pig).

Contents Preface I Introduction Lesson Objectives I-2

Charles Tappert Seidenberg School of CSIS, Pace University

Chapter 8 Advanced SQL.

Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper

Handling Data in PL/SQL Blocks

05 | Processing Big Data with Hive

Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.

Pig Hive HBase Zookeeper

Presentation transcript:

Software and Services Group “Project Panthera”: Better Analytics with SQL, MapReduce and HBase Jason Dai Principal Engineer Intel SSG (Software and Services Group)

2 Software and Services Group My Background and Bias Years of development on parallel compiler Lead architect of Intel network processor compiler –Auto-partitioning & parallelizing for many-core many-thread (128 HW year 2002) CPU Currently Principal Engineer in Intel SSG Leading the open source Hadoop engineering team –HiBench, HiTune, “Project Panthera”, etc. 2 Intel IXP2800

3 Software and Services Group Agenda Overview of “Project Panthera” Analytical SQL engine for MapReduce Document store for better query processing on HBase Summary 3

4 Software and Services Group Project Panthera Our open source efforts to enable better analytics capabilities on Hadoop/HBase Better integration with existing infrastructure using SQL Better query processing on HBase Efficiently utilizing new HW platform technologies Etc. 4

5 Software and Services Group Current Work under Project Panthera An analytical SQL engine for MapReduce Built on top of Hive Provide full SQL support for OLAP A document store for better query processing on HBase A co-processor application for HBase Provide document semantics & significantly speedup query processing 5

6 Software and Services Group Agenda Overview of “Project Panthera” Analytical SQL engine for MapReduce Document store for better query processing on HBase Summary 6

7 Software and Services Group Full SQL Support for Hadoop Needed Full SQL support for OLAP Required in modern business application environment –Business users –Enterprise analytics applications –Third-party tools (such as query builders and BI applications) Hive – THE Data Warehouse for Hadoop HiveQL: a SQL-like query language (subset of SQL with extensions) –Significantly lowers the barrier to MapReduce Still large gaps w.r.t. full analytic SQL support –Multiple-table SELECT statement, subquery in WHERE clauses, etc. 7 Analytic

8 Software and Services Group An analytical SQL engine for MapReduce The anatomy of a query processing engine 8 Parser Semantic Analyzer (Optimizer) Execution Query AST (Abstract Syntax Tree) Execution Plan Hive Parser Hive-AST HiveQL Driver Query Our SQL engine for MapReduce * (Open Source) SQL Parser* SQL- AST SQL-AST Analyzer & Translator Multi-Table SELECT Subquery Unnesting … Hive Semantic Analyzer INTERSECT Support MINUS Support … Hadoop MR SQL Hive- AST

9 Software and Services Group Current Status Enable complex SQL queries (not supported by Hive today), such as, Subquery in WHERE clauses (using ALL, ANY, IN, EXIST, SOME keywords) select * from t1 where t1.d > ALL (select z from t2 where t2.z!=9); Correlated subquery (i.e., a subquery referring to a column of a table not in its FROM clause) select * from t1 where exists ( select * from t2 where t1.b = t2.y ); Scalar subquery (i.e., a subquery that returns exactly one column value from one row) select a,b,c,d,e,(select z from t2 where t2.y = t1.b and z != 99 ) from t1; Top-level subquery (select * from t1) union all (select * from t2) union all (select * from t3 order by 1); Multiple-table SELECT statement select * from t1,t2 where t1.c > t2.z; 9

10 Software and Services Group Current Status NIST SQL Test Suite Version A widely used SQL-92 conformance test suite Ported to run under both Hive and the SQL engine –SELECT statements only –Run against Hive/SQL engine and a RDBMS to verify the results 10 Ported Query# From NIST Hive 0.9SQL Engine Passed Query# Pass Rate Passed Query# Pass Rate All queries % % Subquery related queries 8700%7282.8% Multiple-table select queries 3100%2787.1%

11 Software and Services Group The Path to Full SQL support for OLAP A SQL compatible parser E.g., Hive-3561 Multiple-table SELECT statement E.g., Hive-3578 Full subquery support & optimizations E.g., subquery unnesting (Hive-3577) Complete SQL data type system E.g., DateTime types and functions (Hive-1269) See the umbrella JIRA Hive-3472

12 Software and Services Group Agenda Overview of “Project Panthera” Analytical SQL engine for MapReduce Document store for better query processing on HBase Summary 12

13 Software and Services Group Query Processing on HBase Hive (or SQL engine) over HBase Store data (Hive table) in HBase Query data using HiveQL or SQL –Series of MapReduce jobs scanning HBase Motivations Stream new data into HBase in near realtime Support high update rate workloads (to keep the warehouse always up to date) Allow very low latency, online data serving Etc. 13

14 Software and Services Group Overheads of Query Processing on HBase Space overhead Fully qualified, multi-dimentional map in HBase vs. relational table Performance overhead Among many reasons –Highly concurrent read/write accesses in HBase vs. read- most analytical queries 14 (r 1, cf 1 :C 1, ts)v1v1 (r 1, cf 1 :C 2, ts)v2v2 …… (r 1, cf 1 :C n, ts)vnvn (r 2, cf 1 :C 1, ts)v n+1 …… HBase Table Relational (Hive) Table Row Key C1C1 C2C2 …CnCn r1r1 v1v1 v2v2 …vnvn r2r2 v n+1 v n+2 …v 2n …………… 2~3x space overhead (a 18-column table) ~6x performance overhead (full 18-column table scan )

15 Software and Services Group A Document Store on HBase DOT (Document Oriented Table) on HBase Each row contains a collection of documents (as well as row key) Each document contains a collection of fields A document is mapped to a HBase column and serialized using Avro, PB, etc. Mapping relational table to DOT Each column mapped to a field Schema stored just once Read overheads amortized across different fields in a document 15 Row KeyC1C1 C2C2 …CnCn r1r1 v1v1 v2v2 …vnvn r2r2 v n+1 v n+2 …v 2n …………… … Implemented as a HBase Coprocessor Application Implemented as a HBase Coprocessor Application

16 Software and Services Group Working with DOT Hive/SQL queries on DOT Similar to running Hive with HBase today –Create a DOT in HBase –Create external Hive table with the DOT Use “doc.field” in place of “column qualifier” when specifying “hbase.column.mapping” –Transparent to DML queries No changes to the query or the HBase storage handler 16 CREATE EXTERNAL TABLE table_dot (key INT, C1 STRING, C2 STRING, C3 DOUBLE) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,f:d.c1,f:d.c2, f:d.c3") TBLPROPERTIES ("hbase.table.name"=" table_dot");

17 Software and Services Group Working with DOT Create a DOT in HBase Required to specify the schema and serializer (e.g., Avro) for each document –Stored in table metadata by the preCreateTable co-processor I.e., the table schema is fixed and predetermined at table creation time –OK for Hive/SQL queries 17 HTableDescriptor desc = new HTableDescriptor(“t1”); //Specify a dot table desc.setValue(“hbase.dot.enable”,”true”); desc.setValue(“hbase.dot.type”, ”ANALYTICAL”); … HColumnDescriptor cf2 = new HColumnDescriptor(Bytes.toBytes("cf2")); cf2.setValue("hbase.dot.columnfamily.doc.element",“d3”); //Specify contained document String doc3 = " { \n" + " \"name\": \"d3\", \n" + " \"type\": \"record\",\n" + " \"fields\": [\n" + " {\"name\": \"f1\", \"type\": \"bytes\"},\n" + " {\"name\": \"f2\", \"type\": \"bytes\"},\n" + " {\"name\": \"f3\", \"type\": \"bytes\"} ]\n“ + "}"; cf2.setValue(“hbase.dot.columnfamily.doc.schema.d3”, doc3Schema); //specify the schema for d3 desc.addFamily(cf2Desc); admin.createTable(desc);

18 Software and Services Group Working with DOT Data access for DOT Transparent to the user –Just specify “doc.field” in place of “column qualifier” –Mapping between “document”, “field” & “column qualifier” handled by coprocessors automatically Additional check for Put/Delete today –All fields in a document expected to be updated together; otherwise: Warning for Put (missing field set to NULL value) Error for DELETE –OK for Hive queries 18 Scan scan = new Scan(); scan.addColumn(Bytes.toBytes(“cf1"), Bytes.toBytes(“d1.f1")). addColumn(Bytes.toBytes(“cf2"), Bytes.toBytes(“d3.f1”)); SingleColumnValueFilter filter = new SingleColumnValueFilter( Bytes.toBytes("cf1"), Bytes.toBytes("d1.f1"), CompareFilter.CompareOp.EQUAL, new SubstringComparator("row1_fd1")); scan.setFilter(filter); HTable table = new HTable(conf, “t1”); ResultScanner scanner = table.getScanner(scan); for (Result result : scanner) { System.out.println(result); }

19 Software and Services Group Some Results Benchmarks Create an 18-column table in Hive (on HBase) and load ~567 million rows 19 Table storage 1.7~3x space reduction w/ DOT Data loading ~1.9x speedup for bulk load w/ DOT 3~4x speedup for insert w/ DOT

20 Software and Services Group Some Results Benchmarks Select various numbers of columns form the table select count (col 1, col 2, …, col n ) from table 20 SELECT performance: up to 2x speedup w/ DOT

21 Software and Services Group Summary “Project Panthera” Our open source efforts to eanle better analytics capabilities on Hadoop/HBase – An analytical SQL engine for MapReduce –Provide full SQL support for OLAP Complex subquery, multiple-table SELECT, etc. –Umbrella JIRA HIVE-3472 A document store for better query processing on HBase –Provide document semantics & significantly speedup query processing Up to 3x storage reduction, up to 2x performance speedup –Umbrella JIRA HBASE

22 Software and Services Group Thank You! This slide deck and other related information will be available at Any questions? 22

23 Software and Services Group 23