PolyBase Query Hadoop with ease Sahaj Saini Program Manager, Microsoft.

PolyBase Query Hadoop with ease Sahaj Saini Program Manager, Microsoft

Agenda Why PolyBase? What is it and why customers need it? How it works Demo Q&A

4 All the interest in Big Data Increased number and variety of data sources that generate large quantities of data. Realization that data is “too valuable” to delete. Dramatic decline in the cost of hardware, especially storage. $

5 The Hadoop Ecosystem

Initially MapReduce for insights from HDFS-resident data Recently SQL-like data warehouse technologies on HDFS e.g. Hive, Impala, HAWQ, Spark/Shark Hadoop Evolution

7 7 What if you use RDBMS and Hadoop?

What is PolyBase?

9 Big Picture Provides a T-SQL language extension for combining data from both worlds

10 PolyBase journey 2012 2013 2015… …… 2016 … 2014 PolyBase in SQL Server PDW V2 (Analytics Platform System) PolyBase in SQL Server 2016 CTP2 CTP3 PolyBase in Azure SQL Data Warehouse General Availability

11 PolyBase in SQL Server 2016

12 Customer Example: Auto Insurance Usage-based Insurance Combining non-relational sensor data from cars (kept in Hadoop) with structured customer data (kept in APS) Ability to adjust insurance policies based on driver behavior ‘ Pay-as-you-drive’ - Driver Discount & Policy adjustment Status - In production

13 PolyBase Use Cases

How does PolyBase work?

15 Step 1: Setup a Hadoop Cluster Hortonworks or Cloudera Distributions Linux or Windows On premise or in Azure Namenode (HDFS) File System Hadoop cluster

16 Or Azure Storage Account Azure Storage Blob (ASB) exposes an HDFS layer PolyBase reads and writes from ASB using Hadoop APIs

Step 2: Install SQL Server 17 Select PolyBase feature Adds two new services - PolyBase Engine - PolyBase Data Movement Service Pre-requisite: download and install JRE

1. Install multiple SQL Server instances with PolyBase. Step 3: Scale-out 18 Head Node PolyBase Engine PolyBase DMS PolyBase Engine 2. Choose one as Head Node. 3. Configure remaining as Compute Nodes a.run stored procedure b.shutdown PolyBase Engine c.restart PolyBase DMS

After Step 3 19 PolyBase Group for Scale-out Computation Head node contains the SQL Server instance to which PolyBase queries are submitted Compute nodes are used for scale- out query processing on external data Compute Nodes

Step 4 - Choose Hadoop flavor Latest Supported distributions Cloudera CDH 5.5 on Linux Hortonworks 2.1, 2.2, 2.3 on Linux Hortonworks 2.0, 2.2, 2.3 on Windows Server Azure blob storage (ASB) What happens under the covers? Loading the right client jars to connect to Hadoop -- different numbers map to various Hadoop flavors -- example: value 5 for HDP 2.0 on Linux, value 6 for CHD 5.1-5.5 on Linux, value 7 for HDP 2.1/2.2/2.3 on Linux/Windows or ASB 7

21 After Step 4 Namenode (HDFS) File System

PolyBase Demo SQL Server 2016

PolyBase Design

24 Under-the-hood Exploiting compute resources of Hadoop Clusters with push-down computation

Uses Hadoop RecordReaders/RecordWriters to read/write standard HDFS file types HDFS bridge in DMS

27 Namenode (HDFS) Hadoop Cluster File System Data moves between clusters in parallel SQL16

Creating External Tables Once per Hadoop Cluster Once per File Format HDFS File Path

Creating External Tables (secure Hadoop) Once per Hadoop User HDFS File Path Once per File Format Once per Hadoop user

-- select on external table (data in HDFS) SELECT * FROM SensorData WHERE Speed > 65; A possible execution plan: CREATE temp table T Execute on SQL compute nodes 1 IMPORT FROM HDFS HDFS Customer file read into T in parallel 2 EXECUTE QUERY Select * from T where T.Speed > 65 3 PolyBase Query Example #1

Push-down Computation (1) Leverage the distributed query power of Map Reduce MICROSOFT CONFIDENTIAL - INTERNAL ONLY HDFS File / Directory //hdfs/social_media/twitter/Daily.log Hadoop Column filtering Dynamic binding Row filtering UserLocationProductSentimentRtwtHour Date Sean Suz Audie Tom Sanjay Roger Steve CA WA CO IL MN TX AL xbox excel sqls wp8 ssas ssrs 0 1 1 1 1 1 5 0 0 8 0 0 0 2 2 2 2 1 23 5-25-14 5-24-14 SELECT User, Product, Sentiment FROM Twitter_External_Table WHERE Hour = Current - 1 AND Date = Today AND Sentiment > 0;

HDFS Hadoop 7 2 5 HDFS blocks DB 3 4 6 PolyBase Query 1 MapReduce Cost-based decision on how much computation to push SQL operations on HDFS data pushed into Hadoop as MapReduce jobs Push-down computation (2) Map job

Cost-based Decision (for split-based query execution) Major factor for decision is data volume reduction Hadoop takes 20-30 seconds to spin-up Map job o Spin-up time varies depending on distribution and OS o No push-down for scenarios where SQL can execute under 20-30 seconds w/o push-down Cardinality of predicate matters o Creating statistics on external table (not auto-created) Queries can have “pushable” & “non-pushable” expressions and predicates – Pushable ones will be evaluated on Hadoop side – Processing of non-pushable ones will be done on SQL side – Aggregate functions (sum, count, …) partially pushed – JOINS never pushed, always executed on SQL side External Table External Data source External File Format Your Apps PowerPivot PowerView PDW Engine Service Polybase Storage Layer (PPAX) HDFS Bridge – (as part of DMS) Job Submitter

-- select and aggregate on external table (data in HDFS) SELECT AVG(Speed) FROM SensorData WHERE Speed > 65 GROUP BY State; Execution Plan: PolyBase Query Example #2 Run MR Job on Hadoop Apply filter and compute aggregate on SensorData. 1 What happens here? Step 1: QO compiles predicate into Java. Step 2: PolyBase Engine submits Map job to Hadoop cluster. Output left in hdfsTemp. hdfsTemp CA AK

-- select and aggregate on external table (data in HDFS) SELECT AVG(Speed) FROM SensorData WHERE Speed > 65 GROUP BY State; Execution Plan: PolyBase Query Example #2 1.Query optimizer made a cost- based decision on what operators to push. 2.Predicate and aggregate pushed into Hadoop cluster as a Map job. Run MR Job on Hadoop Apply filter and compute aggregate on Customer. Output left in hdfsTemp 1 IMPORT hdfsTEMP Read hdfsTemp into T 3 CREATE temp table T On SQL compute nodes 2 RETURN OPERATION Read from T Do final aggregation 4 hdfsTemp CA AK

Dr. David DeWitt For letting me use his material to explain the PolyBase technology. Our team in Gray Systems Lab, Madison and Aliso Viejo Acknowledgments

Questions sahajs@microsoft.com

PolyBase Query Hadoop with ease Sahaj Saini Program Manager, Microsoft.

Similar presentations

Presentation on theme: "PolyBase Query Hadoop with ease Sahaj Saini Program Manager, Microsoft."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

PolyBase Query Hadoop with ease Sahaj Saini Program Manager, Microsoft.

Similar presentations

Presentation on theme: "PolyBase Query Hadoop with ease Sahaj Saini Program Manager, Microsoft."— Presentation transcript:

Similar presentations

About project

Feedback