SQL on Hadoop CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

Slides:



Advertisements
Similar presentations
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Advertisements

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
Spark: Cluster Computing with Working Sets
BigData Tools Seyyed mohammad Razavi. Outline  Introduction  Hbase  Cassandra  Spark  Acumulo  Blur  MongoDB  Hive  Giraph  Pig.
HadoopDB An Architectural Hybrid of Map Reduce and DBMS Technologies for Analytical Workloads Presented By: Wen Zhang and Shawn Holbrook.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
ETM Hadoop. ETM IDC estimate put the size of the “digital universe” at zettabytes in forecasting a tenfold growth by 2011 to.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
PARALLEL DBMS VS MAP REDUCE “MapReduce and parallel DBMSs: friends or foes?” Stonebraker, Daniel Abadi, David J Dewitt et al.
Hadoop Ecosystem Overview
Hive – A Warehousing Solution Over a Map-Reduce Framework Presented by: Atul Bohara Feb 18, 2014.
SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL.
Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.
The Hadoop Distributed File System, by Dhyuba Borthakur and Related Work Presented by Mohit Goenka.
Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.
Analytics Map Reduce Query Insight Hive Pig Hadoop SQL Map Reduce Business Intelligence Predictive Operational Interactive Visualization Exploratory.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
H ADOOP DB: A N A RCHITECTURAL H YBRID OF M AP R EDUCE AND DBMS T ECHNOLOGIES FOR A NALYTICAL W ORKLOADS By: Muhammad Mudassar MS-IT-8 1.
Hive : A Petabyte Scale Data Warehouse Using Hadoop
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark Cluster Monitoring 2.
Introduction to Hadoop and HDFS
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
HadoopDB Presenters: Serva rashidyan Somaie shahrokhi Aida parbale Spring 2012 azad university of sanandaj 1.
Hive Facebook 2009.
An Introduction to HDInsight June 27 th,
Key/Value Stores CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
Indexing HDFS Data in PDW: Splitting the data from the index VLDB2014 WSIC、Microsoft Calvin
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Big Data Directions Greg.
Copyright © 2015, SAS Institute Inc. All rights reserved. THE ELEPHANT IN THE ROOM SAS & HADOOP.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
Distributed Time Series Database
Nov 2006 Google released the paper on BigTable.
Impala. Impala: Goals General-purpose SQL query engine for Hadoop High performance – C++ implementation – runtime code generation (using LLVM) – direct.
Cloudera Kudu Introduction
Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.
Em Spatiotemporal Database Laboratory Pusan National University File Processing : Database Management System Architecture 2004, Spring Pusan National University.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
Copyright © 2016 Pearson Education, Inc. Modern Database Management 12 th Edition Jeff Hoffer, Ramesh Venkataraman, Heikki Topi CHAPTER 11: BIG DATA AND.
Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.
OMOP CDM on Hadoop Reference Architecture
Image taken from: slideshare
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
Hadoop.
Introduction to Distributed Platforms
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Spark Presentation.
Scaling SQL with different approaches
Hadoop Clusters Tess Fulkerson.
APACHE HAWQ 2.X A Hadoop Native SQL Engine
Database Performance Tuning and Query Optimization
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Introduction to Apache
Overview of big data tools
Interpret the execution mode of SQL query in F1 Query paper
Charles Tappert Seidenberg School of CSIS, Pace University
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Chapter 11 Database Performance Tuning and Query Optimization
Database System Architectures
Apache Hadoop and Spark
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Pig Hive HBase Zookeeper
Presentation transcript:

SQL on Hadoop CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

ALL OF THESE But HAWQ specifically Cause that's what I know

Problem! “MapReduce is great, but all of my data dudes don’t know Java” Well, Pig and Hive exist... They are kind of SQL “But Pig and Hive are slow and they aren’t really SQL... How can I efficiently use all of my SQL scripts that I have today?” Well, that's why all these companies are building SQL on Hadoop engines... Like HAWQ.

SQL Engines for Hadoop Massive Parallel Processing (MPP) frameworks to run SQL queries against data stored in HDFS Not MapReduce, but still brings the code to the data SQL for big data sets, but not stupid huge ones Stupid huge ones should still use MapReduce

Current SQL Landscape Apache Drill (MapR) Cloudera Impala Facebook Presto Hive Stinger (Hortonworks) Pivotal HAWQ Shark – Hive on Spark (Berkeley)

Why? Ability to execute complex multi-staged queries in-memory against structured data Available SQL-based machine learning libraries can be ported to work on the system A well-known and common query language to express data crunching algorithms Not all queries need to run for hours on end and be super fault tolerant

Okay, tell me more... Many visualization and ETL tools speak SQL, and need to do some hacked version for HiveQL Can now connect these tools and legacy applications to “big data” stored in HDFS You can start leveraging Hadoop with what you know and begin to explore other Hadoop ecosystem projects Your Excuse Here

SQL on Hadoop Built for analytics! – OLAP vs OLTP Large I/O queries against append-only tables Write-once, read-many much like MapReduce Intent is to retrieve results and run deep analytics in ~20 minutes Anything longer, you may want to consider using MapReduce

Architectures Architectures are all very similar Master Query Planner Query Executor HDFS

Playing Nicely In the past it was just a DataNode and a TaskTracker... Now we have – DataNode – NodeManager – HBase RegionServer – SQL-on-Hadoop Query Executor – Storm Supervisor – ??? All of which need memory and CPU time

HAWQ Overview Greenplum database re-platformed on Hadoop/HDFS HAWQ provides all major features found in Greenplum database – SQL Completeness: 2003 Extensions – Cost-Based Query Optimizer – UDFs – Row or Column-Oriented Table Storage – Parallel Loading and Unloading – Distributions – Multi-level Partitioning – High-speed data redistribution – Views – External Tables – Compression – Resource Management – Security – Authentication

Basic HAWQ Architecture

HAWQ Master Located on a separate node from the NameNode Does not contain any user data Contains Global System Catalog Authenticates client connections, processes SQL, distributes work between segments, coordinates results returned by segments, presents final client results

HAWQ Transactions No global transaction management – No updates or deletes.. Transactions at the HAWQ master level – Single phase commit

HAWQ Segments A HAWQ segment within a Segment Host is an HDFS client that runs on a DataNode Multiple segments per Segment Host Segment is the basic unit of parallelism – Multiple segments work together to form a parallel query processing system Operations execute in parallel across all segments

Segments Access Data Stored in HDFS Segments are stateless Segments communicate with NameNode to obtain block lists where data is located Segments access data stored in HDFS

HAWQ Parser Clients JDBC SQL Enforces syntax and semantics Converts SQL query into a parse tree data structure describing details of the query

Parallel Query Optimizer Cost-based optimization looks for the most efficient plan Physical plan contains scans, joins, sorts, aggregations, etc. Directly inserts ‘motion’ nodes for inter- segment communication

Parallel Query Optimizer Continued Inserts motion nodes for efficient non-local join processing (Assume table A is distributed across all segments – i.e. each has A K ) – Broadcast Motion (N:N) Every segment sends A K to all other segments – Redistribute Motion (N:N) Every segment rehashes A K (by join column) and redistributes each row – Gather Motion (N:1) Every segment sends its A K to a single node (usually the master)

Parallel Query Optimization Example SELECT c_custkey, c_name, sum(l_extendedprice * (1 - l_discount)) as revenue, c_acctbal, n_name, c_address, c_phone, c_comment FROM customer, orders, lineitem, nation WHERE c_custkey = o_custkey and l_orderkey = o_orderkey and o_orderdate >= date ' ' and o_orderdate < date ' ' + interval '3 month' and l_returnflag = 'R' and c_nationkey = n_nationkey GROUP BY c_custkey, c_name, c_acctbal, c_phone, n_name, c_address, c_comment ORDER BY revenue desc Gather Motion 4:1 (slice 3) SortHashAggregateHashJoin Redistribute Motion 4:4 (slice 1) HashJoin Seq Scan on lineitem HashSeq Scan on ordersHashHashJoin Seq Scan on customer Hash Broadcast Motion 4:4 (slice 2) Seq Scan on nation

HAWQ Query Optimizer Gather MotionSortHashAggregateHashJoinRedistribute MotionHashJoin Seq Scan on lineitem HashSeq Scan on ordersHashHashJoin Seq Scan on customer HashBroadcast MotionSeq Scan on nation

HAWQ Dispatcher and Query Executor

HAWQ Dynamic Pipelining Parallel data flow using a UDP-based interconnect No materialization of intermediate results, unlike MapReduce

User Defined Function Support  C functions  User Defined Operators  PL/PGSQL  PGCrypto  Future:  User Defined Types  Nested Functions Oracle functions PL/R, PL/Python User Defined Aggregates

Machine Learning Support Open-Source MADlib – Classification – Regression – Clustering – Topic Modeling – Association Rule Mining – Descriptive Statistics – Validation PostGIS

HAWQ Data Storage and I/O Overview DataNodes are responsible for serving read and write requests from HAWQ segments Data stored external to HAWQ can be read using Pivotal Xtension Framework (PXF) external tables Data stored in HAWQ can be written to HDFS for external consumption using PXF Writable HDFS Tables MapReduce can access data stored in HAWQ using provided Input/Output formats

HAWQ Storage Formats Append Only – Read-optimized – Distributions – Partitioning Column Store – Compressions: quicklz, zlib, RLE – MR Input/Output format Parquet – Open-source format – Snappy, gzip

HAWQ Data Locality Think back... how does a client write blocks?

Accessed through libhdfs3 Refactored libhdfs resulting in libhdfs3 – C-based library interacting with HDFS – Leverages protocol buffers to achieve greater performance libhdfs3 is used to access blocks from HAWQ – AKA short-circuit reads libhdfs3 gives huge performance gains over JNI-based libhdfs

Sharded Data and Segment Processors Data is physically sharded in HDFS using directory structure Each segment gets their own directory, and a block is written locally to obtain local disk access read times Affinity between HAWQ segments and shards provides significant I/O gains

Data Distributions Every table has a distribution method DISTRIBUTED BY (column) – Uses a hash distribution DISTRIBUTED RANDOMLY – Uses a random distribution which is not guaranteed to provide a perfectly even distribution

Partitioning Reduces the amount of data to be scanned by reading only the relevant data needed to satisfy a query Supports range partitioning and list partitioning There can be a large number of partitions depending on the partition granularity – Every partition is a file in HDFS

Multi-Level Partitioning Use Hash Distribution to evenly spread data across all nodes Use Range Partition within a node to minimize scan work Segment 1ASegment 1BSegment 1C Segment 1D Segment 2ASegment 2BSegment 2C Segment 2D Segment 3ASegment 3BSegment 3C Segment 3D Jan 2007 Feb 2007 Mar 2007 Apr 2007 May 2007 Jun 2007 Jul 2007 Aug 2007 Sep 2007 Oct 2007 Nov 2007 Dec 2007

HAWQ Parquet Store multiple columns in a single block – Reduces number of files for each table Added support for HAWQ complex data types – Only 4-5 formats supported in Parquet – No UDT or arrays (for now) Added support for append operations with Parquet Set at table and partition level – One partition uses Parquet, another uses AO, etc.

HAWQ Fault Tolerance Fault tolerance through HDFS replication Replication factor decided when creating a file space in HDFS When a segment fails the shard is accessible from another node – via the NameNode and then the DataNode to where the shard was replicated

HAWQ Master Standby Master standby on a separate host from the HAWQ Master Warm standby kept up to date by transactional log replication Replicated logs used to reconstruct state of HAWQ Master System catalogs synchronized

Pivotal Xtension Framework External table interface inside HAWQ to read data stored in Hadoop ecosystem External tables can be used to Load data into HAWQ from Hadoop Query Hadoop data without materializing it into HAWQ Enables loading and querying of data stored in HDFS HBase Hive

PXF Features  Applies data locality optimizations to reduce resources and network traffic Supports filtering through predicate push down in HBase –, =, =, != between a column and a constant – Can AND between these (but not OR) Supports Hive table partitioning Supports ANALYZE for gathering HDFS file statistics and having it available for the query planner at run time Extensible framework via Java to enable custom data sources and formats

PXF Supported Data Formats Text/CSV SequenceFiles Avro JSON Hive HBase Accumulo

PXF Architecture Components Fragmenter Accessor Resolver Analyzer Often able to leverage custom Hadoop I/O Formats

PXF Loading into HAWQ To load data into HAWQ use a variation of – INSERT INTO SELECT * FROM ; Data can be transformed in-flight before loading Data from PXF can also be joined in-flight with native tables Number of segments responsible for connecting and concurrent reading of data can be tuned

References Apache Drill Cloudera Impala Facebook Presto Hive Stinger Pivotal HAWQ Shark