PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme

Slides:



Advertisements
Similar presentations
David J. DeWitt Microsoft Jim Gray Systems Lab Madison, Wisconsin graysystemslab.com.
Advertisements

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
FAST FORWARD WITH MICROSOFT BIG DATA Vinoo Srinivas M Solutions Specialist Windows Azure (Hadoop, HPC, Media)
Observation Pattern Theory Hypothesis What will happen? How can we make it happen? Predictive Analytics Prescriptive Analytics What happened? Why.
Running Hadoop-as-a-Service in the Cloud
Microsoft Ignite /16/2017 5:47 PM
Lower costs and improve predictability Automation Enable service owners to focus on work that adds business value Reduce error-prone manual activities.
Jeremy Boyd Director – Mindscape MSDN Regional Director
Hive – A Warehousing Solution Over a Map-Reduce Framework Presented by: Atul Bohara Feb 18, 2014.
PlacePlace TypeType ServiceService Analysis Caching Integration Sync Search Relational BLOB Query BackupLoad Multi Dim In Memory File XML Reporting.
This presentation was scheduled to be delivered by Brian Mitchell, Lead Architect, Microsoft Big Data COE Follow him Contact him.
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
Analytics Map Reduce Query Insight Hive Pig Hadoop SQL Map Reduce Business Intelligence Predictive Operational Interactive Visualization Exploratory.
Training Workshop Windows Azure Platform. Presentation Outline (hidden slide): Technical Level: 200 Intended Audience: Developers Objectives (what do.
MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Introduction to Hadoop and HDFS
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Microsoft TechForge 2009 SQL Server 2008 Unplugged Microsoft’s Data Platform Vinod Kumar Technology Evangelist – DB and BI
An Introduction to HDInsight June 27 th,
Indexing HDFS Data in PDW: Splitting the data from the index VLDB2014 WSIC、Microsoft Calvin
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Modern Data Warehouse: Microsoft APS Alain Dormehl June 2015.
Windows Azure. Azure Application platform for the public cloud. Windows Azure is an operating system You can: – build a web application that runs.
Sponsorzy strategiczni Sponsorzy srebrni. PolyBase – data beyond tables Hubert Kobierzewski.
Impala. Impala: Goals General-purpose SQL query engine for Hadoop High performance – C++ implementation – runtime code generation (using LLVM) – direct.
What we know or see What’s actually there Wikipedia : In information technology, big data is a collection of data sets so large and complex that it.
Big Data Analytics Platforms. Our Team NameApplication Viborov MichaelApache Spark Bordeynik YanivApache Storm Abu Jabal FerasHPCC Oun JosephGoogle BigQuery.
Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.
Please note that the session topic has changed
Technology Drill Down: Windows Azure Platform Eric Nelson | ISV Application Architect | Microsoft UK |
SQL Server 2012 Session: 1 Session: 4 SQL Azure Data Management Using Microsoft SQL Server.
AZURE DISTRIBUTED DATA Storage, HDInsight Hadoop, Azure Data Lake.
PolyBase Query Hadoop with ease Sahaj Saini SQL Server, Microsoft.
Azure SQL DW – Elastic Data Analytics in the cloud Josh Sivey | Microsoft TSP #492 | Phoenix.
The Data Warehouse of the Future Where to Now? 1.
Copyright © 2016 Pearson Education, Inc. Modern Database Management 12 th Edition Jeff Hoffer, Ramesh Venkataraman, Heikki Topi CHAPTER 11: BIG DATA AND.
MSBIC Hadoop Series Hadoop & Microsoft BI Bryan Smith
Andy Roberts Data Architect
An Introduction To Big Data For The SQL Server DBA.
Apache Hadoop on Windows Azure Avkash Chauhan
PolyBase Query Hadoop with ease Sahaj Saini Program Manager, Microsoft.
Redmond Protocols Plugfest 2016 Casey Karst PolyBase in SQL Server 2016.
Microsoft Ignite /28/2017 6:07 PM
PolyBase overview Speaker Name
Data Platform and Analytics Foundational Training
PolyBase: T-SQL Reaching Beyond the Database
Data Platform and Analytics Foundational Training
Microsoft /2/2018 3:42 PM BRK3129 Query Big Data using the Expanded T-SQL footprint with PolyBase in SQL Server 2016 Casey Karst Program Manager.
The Model Architecture with SQL and Polybase
Building Analytics At Scale With USQL and C#
Polybase Didn’t That Go Out in the 70’s Stan Geiger.
Microsoft Ignite NZ October 2016 SKYCITY, Auckland.
A developers guide to Azure SQL Data Warehouse
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
07 | Analyzing Big Data with Excel
Server & Tools Business
DeFacto Planning on the Powerful Microsoft Azure Platform Puts the Power of Intelligent and Timely Planning at Any Business Manager’s Fingertips Partner.
A developers guide to Azure SQL Data Warehouse
Henk van der Valk Oct.15, 2016 Level: Beginner
20 Questions with Azure SQL Data Warehouse
Managing batch processing Transient Azure SQL Warehouse Resource
Big-Data Analytics with Azure HDInsight
Moving your on-prem data warehouse to cloud. What are your options?
Customer 360.
SQL Server 2019 Bringing Apache Spark to SQL Server
Pig Hive HBase Zookeeper
Presentation transcript:

PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme Microsoft Jim Gray Systems Lab

A Little Bit of PolyBase History… PolyBase is now part of SQL 16 (CTP3) PolyBase in SQL Server PDW V2 TODAY PolyBase for IOT??? 2012 2013 2014 2015 2016 … … … … Future Past

Our Plan for Today… PolyBase what? what’s next why? how?

what? IS PolyBase?

PolyBase Big Picture RDBMS Hadoop 4/26/2017 11:09 AM Big Picture RDBMS Hadoop PolyBase Provides a scalable, T-SQL compatible query processing framework for combining data from both worlds © 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

More Specifically Allow SQL 16 customers to execute T-SQL queries against relational data in SQL Server and “semistructured” data in HDFS and/or Azure SELECT Results SQL Server 16 Windows Azure Blob Storage (WASB) Hadoop (non-relational data)

PolyBase for SQL Server 16 What Does It Mean? PolyBase for SQL Server 16 Full T-SQL interface for HDFS-resident data Clusters of SQL Server 16 instances can be used to process data residing in HDFS in parallel Queries can arbitrarily mix relational data in SQL Server with data in HDFS or Azure All standard BI tools supported

PolyBase what? what’s next why? how?

All The Interest in Big Data why? All The Interest in Big Data Increased number and variety of data sources that generate large quantities of data Sensors (e.g. location, speed, acceleration rates, acoustical, …) Web 2.0 (e.g. twitter, wikis, … ) Web clicks Realization that data is “too valuable” to delete Dramatic decline in the cost of hardware, especially storage If storage was still $100/GB there would be no big data revolution underway

Hadoop Community World View rest of the world Hadoop Community World View Parallel SQL systems for data warehousing on HDFS Scalable distributed file system Underpinnings of the entire Hadoop ecosystem Hive Impala HAWQ Spark/ Shark HDFS (Hadoop) Insight Recently, SQL-like Highly fault-tolerant Initially, MapReduce Append only – no updates! Big Data goes HERE

PolyBase’s HDFS Relational PolyBase World View Polybase’s World View Insight Can I join you? Relational HDFS Polybase Insight HDFS Two universes of data Structured relational data Semistructured data in HDFS for “big data” Polybase Goal: Provide a scalable, T-SQL compatible query processing framework for combining data from both universes You sure CAN! (semi-structured) Relational (structured) PolyBase

PolyBase Use Cases 2 1 4 3 Sensor data analysis Hadoop as an ETL tool Sentiment analysis 3 e.g., joining relational tables w/ streams of tweets Sensor data analysis 4 e.g., mine sensor data for predictive analytics Hadoop as an ETL tool 1 e.g., cleansing data before loading it HDFS as ‘cold’, but query-able” storage 2 PolyBase PolyBase PolyBase PolyBase

(Non-Relational Sensor data Structured Customer data Example #1: Progressive Usage-Based Insurance (Non-Relational Sensor data Sensor data from cars (kept in Hadoop) Structured Customer data Relational data (kept in SQL Server PDW/APS) Price policies (based on driver behavior)

(Non-Relational Social Media Structured Product data Example #2: ShinSeGae (Korea’s Amazon) (Non-Relational Social Media (kept in Hadoop) Structured Product data Product data (kept in SQL Server PDW/APS) All standard BI tools (e.g. Excel, Tableau, Powerview) just work! Basket Analysis of online shoppers (based on social media behavior)

Example 3: Wind Turbine Manufacturer Turbine Monitoring Analyzing sensor data from wind turbines deployed world wide (kept in Hadoop) combined with relational turbine data (kept in SQL Server) Ability to do change detection, proactive maintenance and reporting © 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

PolyBase what? what’s next why? how?

Step 1: PolyBase for SQL 16 Deployment Step 1: Set up a Hadoop cluster (if you don’t have one already) Hortonworks or Cloudera Distributions Hadoop 2.0 or above Linux or Windows On premise or in Azure Step 1

Or An Azure Storage Account 4/26/2017 11:09 AM Or An Azure Storage Account Azure Storage Volume Azure Storage Blobs (ASB) exposes an HDFS layer PolyBase reads and writes from ASB using Hadoop RecordReaders/Writers No compute push-down support for ASB © 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Step 2: Install One or More SQL Server Instances Select PolyBase feature. Specify PolyBase service accounts (requires domain account for scale-out) PolyBase DLLs (Engine and DMS) are installed and registered as Windows Services. Pre-requisite: download and install JRE

Step 3: Instantiate a PolyBase Scale Out Group Microsoft Ignite 2015 4/26/2017 11:09 AM Step 3: Instantiate a PolyBase Scale Out Group Head Node Compute Nodes Step 3: One machine is used as a Head Node by starting Engine & DMS services SQL 16 SQL 16 Engine Service SQL 16 SQL 16 SQL 16 DMS DMS DMS DMS DMS DMS 2. Zero or more other machines become Compute Nodes via a stored procedure (shutdown and restart needed) DMS Step 3 © 2015 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

After Step 3 Polybase Computation Scale-out Group Head node is the SQL Server instance to which queries are submitted Compute nodes are used for scale out query processing on external data SQL 16 Engine Service DMS Head Node Compute Nodes

In Case You Were Wondering… Head Node SQLxx SQL 16 Engine Service DMS Compute Node Head Node is actually also always a Compute Node

Step 4 - Choose Hadoop flavor -- different numbers map to various Hadoop flavors -- example: value 4 stands for HDP 2.0 on Linux or ASB, value 5 for HDP 2.0 on Windows, value 6 for CHD 5.1 on Linux Supported Hadoop distributions in CTP3 Cloudera CDH 5.1-5.5 on Linux Hortonworks 2.0 - 2.3 on Linux Hortonworks 2.0, 2.2, and 2.3 on Windows Server What happens under the covers? The right client jars to connect to Hadoop are loaded

Step 5: Attach a Hadoop Cluster CREATE EXTERNAL DATA SOURCE GSL_HDFS_CLUSTER WITH (TYPE= HADOOP, …) CREATE EXTERNAL DATA SOURCE AV_DFS_CLUSTER WITH (TYPE= HADOOP, …) Hadoop clusters must both be either Cloudera or Hortonworks (for now) Azure Storage Volume Hadoop Cluster #2

After Setup… Tables on these 3 compute nodes cannot be referenced by queries submitted to head node T-SQL queries submitted here Compute nodes are used for scale out query processing on external tables in HDFS/Azure If failover clustering is enabled queries can be submitted to any compute node Queries can only refer to tables here and/or external tables here Hadoop clusters can be shared between different SQL 16 compute groups

PDW, SQL DW and SQL 16: Key Differences Scaleout Storage Scaleout QP Language Surface Delivery Vehicle SQL PDW Relational, HDFS, Azure Relational, HDFS, Azure T-SQL Subset Appliance SQL DW Relational, HDFS, Azure Relational, HDFS, Azure T-SQL Subset Cloud (PASS) SQL 16 Full T-SQL (RTM) “Box” & Cloud (IAAS) HDFS, Azure HDFS, Azure

Key Technical Challenges in HDFS (e.g., Text, RC, ORC, Parquet, … ) Supporting Arbitrary File Formats between compute nodes and HDFS data nodes Parallelizing Data Transfers in HDFS, using external table concept Imposing Structure on Unstructured Data of Hadoop clusters Exploiting Computational Resources

HDFS Bridge in PolyBase Hadoop Cluster DMS SQL Server HDFS Bridge Compute Node DMS HDFS Bridge (augmented w/) Hides complexity of HDFS Uses Hadoop “RecordReaders/Writers”  standard HDFS file types can be read/written Used to transfer data in parallel to & from Hadoop

Key Technical Challenges in HDFS (e.g., Text, RC, ORC, Parquet, … ) Supporting Arbitrary File Formats between compute nodes and HDFS data nodes Parallelizing Data Transfers in HDFS, using external table concept Imposing Structure on Unstructured Data of Hadoop clusters Exploiting Computational Resources

Data moves between clusters in parallel Compute Nodes Engine Service SQL Server DMS DB Head Node Namenode (HDFS) Hadoop Cluster DataNode File System

Key Technical Challenges in HDFS (e.g., Text, RC, ORC, Parquet, … ) Supporting Arbitrary File Formats between compute nodes and HDFS data nodes Parallelizing Data Transfers in HDFS, using external table concept Imposing Structure on Unstructured Data of Hadoop clusters Exploiting Computational Resources

Creating External Tables CREATE EXTERNAL DATA SOURCE HadoopCluster WITH (TYPE = Hadoop, LOCATION = 'hdfs://10.193.26.177:8020', RESOURCE_MANAGER_LOCATION = '10.193.26.178:8050'); Once per Hadoop Cluster CREATE EXTERNAL FILE FORMAT TextFile WITH ( FORMAT_TYPE = DELIMITEDTEXT, DATA_COMPRESSION = 'org.apache.hadoop.io.compress.GzipCodec', FORMAT_OPTIONS (FIELD_TERMINATOR ='|', USE_TYPE_DEFAULT = TRUE)); Once per File Format CREATE EXTERNAL TABLE [dbo].[Customer] ( [SensorKey] int NOT NULL, [CustomerKey] int NOT NULL, [Speed] float NOT NULL ) WITH (LOCATION='//Sensor_Data//May2014/sensordata.tbl', DATA_SOURCE = HadoopCluster, FILE_FORMAT = TextFile HDFS File Path

Creating External Tables (Secure Hadoop) CREATE DATABASE SCOPED CREDENTIAL HadoopCredential WITH IDENTITY = 'hadoopUserName', Secret = 'hadoopPassword'; Once per Hadoop user CREATE EXTERNAL DATA SOURCE HadoopCluster WITH (TYPE = Hadoop, LOCATION = 'hdfs://10.193.26.177:8020', RESOURCE_MANAGER_LOCATION = '10.193.26.178:8050', CREDENTIAL = HadoopCredential); Once per Hadoop user CREATE EXTERNAL FILE FORMAT TextFile WITH ( FORMAT_TYPE = DELIMITEDTEXT, DATA_COMPRESSION = 'org.apache.hadoop.io.compress.GzipCodec', FORMAT_OPTIONS (FIELD_TERMINATOR ='|', USE_TYPE_DEFAULT = TRUE)); Once per File Format CREATE EXTERNAL TABLE [dbo].[Customer] ( [SensorKey] int NOT NULL, [CustomerKey] int NOT NULL, [Speed] float NOT NULL ) WITH (LOCATION='//Sensor_Data//May2014/sensordata.tbl', DATA_SOURCE = HadoopCluster, FILE_FORMAT = TextFile HDFS File Path

Polybase Query Example #1 Selection on external table in HDFS select * from Customer where c_nationkey = 3 and c_acctbal < 0 A possible execution plan: EXECUTE QUERY Select * from T where T.c_nationkey =3 and T.c_acctbal < 0 3 IMPORT FROM HDFS HDFS Customer file read into T 2 CREATE temp table T Execute on compute nodes 1

Key Technical Challenges in HDFS (e.g., Text, RC, ORC, Parquet, … ) Supporting Arbitrary File Formats between compute nodes and HDFS data nodes Parallelizing Data Transfers in HDFS, using external table concept Imposing Structure on Unstructured Data of Hadoop clusters Exploiting Computational Resources

PolyBase Query Execution Query plan generator walks optimized query plan converting subtrees whose inputs are all HDFS files into sequence of MapReduce jobs Query is parsed “External tables” stored on HDFS are identified HDFS Hadoop Hadoop nodes Query Plan Generator SQL Query Query Optimizer Engine Service Parser Logical operator tree Engine Service submits MapReduce jobs (as a JAR file) to Hadoop cluster. Leverage computational capabilities of Hadoop cluster Physical operator tree Parallel QO is performed Statistics on HDFS tables are used in the standard fashion

Big Picture Takeaway Map job HDFS blocks SQL Query Results PDW or 1 Results 7 Hadoop 2 Map job PDW or SQL 16 SQL operations on HDFS data pushed into Hadoop as MapReduce jobs Cost-based decision on how much computation to push MapReduce 3 4 5 HDFS blocks 6 HDFS DB

PolyBase Query Example #2 Selection and aggregate on external table in HDFS select avg(c_acctbal) from Customer where c_acctbal < 0 group by c_nationkey Execution plan: What really happens here? Step 1) QO compiles predicate into Java and generates a MapReduce (MR) job Step 2) QE submits MR job to Hadoop cluster Output left in hdfsTemp hdfsTemp <US, $-975.21> <FRA, $-119.13> <UK, $-63.52> Run MR Job on Hadoop Apply filter and compute aggregate on Customer. 1

PolyBase Query Example #2 Selection and aggregate on HDFS table select avg(c_acctbal) from Customer where c_acctbal < 0 group by c_nationkey Execution plan: Predicate and aggregate pushed into Hadoop cluster as a MapReduce job Query optimizer makes a cost-based decision on what operators to push RETURN OPERATION Select * from T 4 IMPORT hdfsTEMP Read hdfsTemp into T 3 CREATE temp table T On compute nodes 2 hdfsTemp <US, $-975.21> <FRA, $-119.13> <UK, $-63.52> Run MR Job on Hadoop Apply filter and computes aggregate on Customer. Output left in hdfsTemp 1

Full Integration with Microsoft Office & BI Highest Possible Performance Parallelized data transfers between SQL Compute Nodes and Hadoop clusters. Push down of SQL operations to Hadoop Simplicity Query data in Hadoop and/or data in SQL Server via standard T-SQL Open Supports most popular Hadoop distributions for both Linux and Windows PolyBase Full Integration with Microsoft Office & BI Excel’s PowerPivot, PowerView, Tableau, Cognos, SQL Server Reporting & Analysis Services

External tables referring to data in two HDP Hadoop clusters Query Capabilities (1) Joining relational and external data SQL Table External tables referring to data in two HDP Hadoop clusters SELECT FROM Querying external tables Joining external with regular SQL tables Pushing compute for basic expressions and aggregates

Query Capabilities (2) Push-Down Computation Pushing Compute Either on data source level or Per-query basis using query hints

Query Capabilities (3) Multiple User IDs Credential support Credential support for multiple user IDs associated with external data source

Query Capabilities (4) Seamless BI integration External Tables ‘just’ show up as regular tables Excel (e.g. PowerQuery) Tableau

External table referring to data in HDP Hadoop clusters Parallel Import/Load Scenario new SQL Table created External table referring to data in HDP Hadoop clusters SELECT INTO: Importing data from Hadoop for higher speed access ‘ETL’ type of processing possible via T-SQL

External table for aging data into Hadoop (as text files) Export Scenario – Data aging into Hadoop External table for aging data into Hadoop (as text files) INSERT INTO Exporting SQL data into Hadoop ‘ETL’ type of processing possible via T-SQL

Some Questions You Might Have… Will alwaysOn (Hadron) be supported? Can a compute node be used for other SQL workloads? Can a compute node share a machine with a Hadoop data node? What SQL Server editions will I need? Will the MapR Hadoop distribution be supported? Is there a limit on the number of compute nodes? Why is it not possible to support different Hadoop distros simultaneously?

PolyBase what? what’s next why? how?

CTP3/RTM Limitations Each SQL 16 instance can participate in a single PolyBase cluster – A limitation of current DMS software No support for varchar(max) or unique column types 3. No scale out for joins between tables on Head Node and external tables on HDFS (“Local table scaleout”)

A Very Short Demo SQL 16 w. Polybase (4 node SQL Scaleout Group) 3 node Hortonworks 2.0 Linux cluster Car telemetry data in Hadoop (CarSensor_Data) Customer data in SQL 16 (Insured_Customers)

Wrapup: PolyBase in SQL Server 16 Simplicity - Query data in Hadoop, Azure and SQL Server via standard T-SQL Highest Possible Performance - Parallelized data transfers between SQL 16 instances and Hadoop data nodes. Push down of SQL operations to Hadoop using MR jobs Open - Supports most popular Hadoop distributions for both Linux and Windows Full Integration with Microsoft Office & BI - Excel’s PowerPivot, PowerView, Tableau, Cognos, SQL Server Reporting & Analysis Services

Acknowledgements The entire PolyBase development team Sahaj Saini for the demo and a few slides Artin Avanes for his role as the PolyBase PM since inception

THANK YOU

CTP3/RTM Query Processing Limitation Select C.Name from Orders O, Customers C where C.id = O.CustId and O.price > $1000 tmp Probable CTP2 Query Plan: 1) Use MR job to find Orders > $1000 2) Import result into SQL 16 instance on Head Node 3) Perform join on Head Node

A Better Query Plan Select C.Name from Customers C, Orders O where C.id = O.CustId and P.price > $1000 CTP3 plan likely to be: 1) Use MR job to find Orders > $1000 2) Import result into SQL 16 instances on Compute nodes 3) Redistribute Customers table from Head Node to Compute Nodes 4) Perform join in parallel on compute nodes

The Hadoop Ecosystem Management & Monitoring (Ambari) Coordination (ZooKeeper) Workflow &Scheduling (Oozie) Scripting (Pig) Machine Learning (Mahout) Query (Hive) Distributed Processing (MapReduce) Distributed Storage (HDFS) NoSQL Database (HBase) (Sqoop/REST/ODBC) Data Integration

SQL Server PDW with PolyBase PolyBase = SQL Server PDW V2 querying HDFS/Azure data, in-situ SQL Results HDFS PolyBase Polybase Standard T-SQL query language. Eliminates need for writing MapReduce jobs SQL SERVER PDW V2 All of this is now part of SQL Server 16 Leverages PDW’s parallel query execution framework Data moves in parallel directly between Hadoop’s Data Nodes and PDW’s compute nodes DB Exploits PDW’s parallel query optimizer to selectively push computations on HDFS data as MapReduce jobs

PDW Software Control Node PDW Engine Service SQL Server Client Connections User Queries Data Movement Service (DMS) Separate process on each node Shuffles intermediate tables among compute nodes during query execution Compute Node DMS SQL Server JDBC, OLEDB, ODBC, ADO.NET Parse SQL Validate and authorize Optimize and build query plan Execute parallel query Return results to client Compute Node Control Node DMS SQL Server PDW Engine Service Compute Node DMS SQL Server DMS Service SQL Server PDW comes in a number of hardware configurations, with the necessary software pre-installed and ready to use. It has a control node that manages a number of compute nodes (see Figure 1). The control node provides the external interface to the appliance, and query requests flow through it. The control node is responsible for query parsing, creating a distributed execution plan, issuing plan steps to the compute nodes, tracking the execution steps of the plan, and assembling the individual pieces of the final results into the single result set that is returned to the user. Compute nodes provide the data storage and the query processing backbone of the appliance. The control and compute nodes each have a single instance of SQL Server RDBMS running on them. User data is stored in tables that are hash-partitioned or replicated tables across the SQL Server instances on the compute nodes. To execute a query, the control node transforms the user query into a distributed execution plan (called DSQL plan) that consists of a sequence of operations (called DSQL Operations). At a highlevel, every DSQL plan is composed of two types of operations: (1) SQL operations, which are SQL statements to be executed against the underlying compute nodes’ DBMS instances, and (2) DMS op- erations1 which are operations to transfer data between DBMS instances on different nodes. Compute Node DMS SQL Server SQL Server Compute Node DMS SQL Server

PDW SUMMARY KEY COMPONENTS Scalable SQL Server DW offering Highly competitive performance and cost Currently only available in appliance form factor But, soon available as SQL DW Service in Azure PolyBase for SQL 16 reuses Engine Service and DMS components KEY COMPONENTS Multiple compute nodes Each with a SQL Server instance + DMS Execute SQL queries in parallel One control node Engine Service + DMS Compiles and controls execution of SQL queries in parallel

Step 3: Install one or more SQL Server instances Step 3: Scale Out Step 3: Install one or more SQL Server instances SQL 16 PolyBase DLLs SQL 16 SQL installer will allow DBA to specify desire to use PolyBase feature. If selected, installer will install PolyBase DLLs (DMS and Engine Service) on each machine and register them as Windows Services Step 2

Typical PolyBase Setup Compute Nodes Engine Service PDW Appliance SQL Server DMS DB Control Node Hortonworks or Cloudera Hadoop distributions Many popular HDFS file formats supported (text, RC, ORC, … ) Hadoop cluster can be either on premise or in the cloud Windows Cluster Hadoop cluster can be either: Namenode (HDFS) DataNode File System Linux Cluster Text Format RCFile ORCFile Parquet (future) … Hadoop Cluster

Key Components Reused for SQL Server 16 HDFS Bridge in DMS HDFS Bridge in DMS used to read/write files/directories in HDFS or Azure DMS used to move data between compute nodes and Hadoop data nodes MapReduce Job Pushdown used by Engine Service to push computation to Hadoop cluster MR Job Pushdown Engine Service to parse, optimize, & orchestrate parallel execution of queries over relational tables and HDFS data External Table Construct used to “surface” records in external tables in HDFS or Azure files to SQL