Presentation is loading. Please wait.

Presentation is loading. Please wait.

PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme

Similar presentations


Presentation on theme: "PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme"— Presentation transcript:

1 PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme
Microsoft Jim Gray Systems Lab

2 A Little Bit of PolyBase History…
PolyBase is now part of SQL 16 (CTP3) PolyBase in SQL Server PDW V2 TODAY PolyBase for IOT??? 2012 2013 2014 2015 2016 Future Past

3 Our Plan for Today… PolyBase what? what’s next why? how?

4 what? IS PolyBase?

5 PolyBase Big Picture RDBMS Hadoop
4/26/ :09 AM Big Picture RDBMS Hadoop PolyBase Provides a scalable, T-SQL compatible query processing framework for combining data from both worlds © 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

6 More Specifically Allow SQL 16 customers to execute T-SQL queries against relational data in SQL Server and “semistructured” data in HDFS and/or Azure SELECT Results SQL Server 16 Windows Azure Blob Storage (WASB) Hadoop (non-relational data)

7 PolyBase for SQL Server 16
What Does It Mean? PolyBase for SQL Server 16 Full T-SQL interface for HDFS-resident data Clusters of SQL Server 16 instances can be used to process data residing in HDFS in parallel Queries can arbitrarily mix relational data in SQL Server with data in HDFS or Azure All standard BI tools supported

8 PolyBase what? what’s next why? how?

9 All The Interest in Big Data
why? All The Interest in Big Data Increased number and variety of data sources that generate large quantities of data Sensors (e.g. location, speed, acceleration rates, acoustical, …) Web 2.0 (e.g. twitter, wikis, … ) Web clicks Realization that data is “too valuable” to delete Dramatic decline in the cost of hardware, especially storage If storage was still $100/GB there would be no big data revolution underway

10 Hadoop Community World View
rest of the world Hadoop Community World View Parallel SQL systems for data warehousing on HDFS Scalable distributed file system Underpinnings of the entire Hadoop ecosystem Hive Impala HAWQ Spark/ Shark HDFS (Hadoop) Insight Recently, SQL-like Highly fault-tolerant Initially, MapReduce Append only – no updates! Big Data goes HERE

11 PolyBase’s HDFS Relational PolyBase World View Polybase’s World View
Insight Can I join you? Relational HDFS Polybase Insight HDFS Two universes of data Structured relational data Semistructured data in HDFS for “big data” Polybase Goal: Provide a scalable, T-SQL compatible query processing framework for combining data from both universes You sure CAN! (semi-structured) Relational (structured) PolyBase

12 PolyBase Use Cases 2 1 4 3 Sensor data analysis Hadoop as an ETL tool
Sentiment analysis 3 e.g., joining relational tables w/ streams of tweets Sensor data analysis 4 e.g., mine sensor data for predictive analytics Hadoop as an ETL tool 1 e.g., cleansing data before loading it HDFS as ‘cold’, but query-able” storage 2 PolyBase PolyBase PolyBase PolyBase

13 (Non-Relational Sensor data Structured Customer data
Example #1: Progressive Usage-Based Insurance (Non-Relational Sensor data Sensor data from cars (kept in Hadoop) Structured Customer data Relational data (kept in SQL Server PDW/APS) Price policies (based on driver behavior)

14 (Non-Relational Social Media Structured Product data
Example #2: ShinSeGae (Korea’s Amazon) (Non-Relational Social Media (kept in Hadoop) Structured Product data Product data (kept in SQL Server PDW/APS) All standard BI tools (e.g. Excel, Tableau, Powerview) just work! Basket Analysis of online shoppers (based on social media behavior)

15 Example 3: Wind Turbine Manufacturer
Turbine Monitoring Analyzing sensor data from wind turbines deployed world wide (kept in Hadoop) combined with relational turbine data (kept in SQL Server) Ability to do change detection, proactive maintenance and reporting © 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

16 PolyBase what? what’s next why? how?

17 Step 1: PolyBase for SQL 16 Deployment
Step 1: Set up a Hadoop cluster (if you don’t have one already) Hortonworks or Cloudera Distributions Hadoop 2.0 or above Linux or Windows On premise or in Azure Step 1

18 Or An Azure Storage Account
4/26/ :09 AM Or An Azure Storage Account Azure Storage Volume Azure Storage Blobs (ASB) exposes an HDFS layer PolyBase reads and writes from ASB using Hadoop RecordReaders/Writers No compute push-down support for ASB © 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

19 Step 2: Install One or More SQL Server Instances
Select PolyBase feature. Specify PolyBase service accounts (requires domain account for scale-out) PolyBase DLLs (Engine and DMS) are installed and registered as Windows Services. Pre-requisite: download and install JRE

20 Step 3: Instantiate a PolyBase Scale Out Group
Microsoft Ignite 2015 4/26/ :09 AM Step 3: Instantiate a PolyBase Scale Out Group Head Node Compute Nodes Step 3: One machine is used as a Head Node by starting Engine & DMS services SQL 16 SQL 16 Engine Service SQL 16 SQL 16 SQL 16 DMS DMS DMS DMS DMS DMS 2. Zero or more other machines become Compute Nodes via a stored procedure (shutdown and restart needed) DMS Step 3 © 2015 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

21 After Step 3 Polybase Computation Scale-out Group
Head node is the SQL Server instance to which queries are submitted Compute nodes are used for scale out query processing on external data SQL 16 Engine Service DMS Head Node Compute Nodes

22 In Case You Were Wondering…
Head Node SQLxx SQL 16 Engine Service DMS Compute Node Head Node is actually also always a Compute Node

23 Step 4 - Choose Hadoop flavor
-- different numbers map to various Hadoop flavors -- example: value 4 stands for HDP 2.0 on Linux or ASB, value 5 for HDP 2.0 on Windows, value 6 for CHD 5.1 on Linux Supported Hadoop distributions in CTP3 Cloudera CDH on Linux Hortonworks on Linux Hortonworks 2.0, 2.2, and 2.3 on Windows Server What happens under the covers? The right client jars to connect to Hadoop are loaded

24 Step 5: Attach a Hadoop Cluster
CREATE EXTERNAL DATA SOURCE GSL_HDFS_CLUSTER WITH (TYPE= HADOOP, …) CREATE EXTERNAL DATA SOURCE AV_DFS_CLUSTER WITH (TYPE= HADOOP, …) Hadoop clusters must both be either Cloudera or Hortonworks (for now) Azure Storage Volume Hadoop Cluster #2

25 After Setup… Tables on these 3 compute nodes cannot be referenced by queries submitted to head node T-SQL queries submitted here Compute nodes are used for scale out query processing on external tables in HDFS/Azure If failover clustering is enabled queries can be submitted to any compute node Queries can only refer to tables here and/or external tables here Hadoop clusters can be shared between different SQL 16 compute groups

26 PDW, SQL DW and SQL 16: Key Differences
Scaleout Storage Scaleout QP Language Surface Delivery Vehicle SQL PDW Relational, HDFS, Azure Relational, HDFS, Azure T-SQL Subset Appliance SQL DW Relational, HDFS, Azure Relational, HDFS, Azure T-SQL Subset Cloud (PASS) SQL 16 Full T-SQL (RTM) “Box” & Cloud (IAAS) HDFS, Azure HDFS, Azure

27 Key Technical Challenges
in HDFS (e.g., Text, RC, ORC, Parquet, … ) Supporting Arbitrary File Formats between compute nodes and HDFS data nodes Parallelizing Data Transfers in HDFS, using external table concept Imposing Structure on Unstructured Data of Hadoop clusters Exploiting Computational Resources

28 HDFS Bridge in PolyBase
Hadoop Cluster DMS SQL Server HDFS Bridge Compute Node DMS HDFS Bridge (augmented w/) Hides complexity of HDFS Uses Hadoop “RecordReaders/Writers”  standard HDFS file types can be read/written Used to transfer data in parallel to & from Hadoop

29 Key Technical Challenges
in HDFS (e.g., Text, RC, ORC, Parquet, … ) Supporting Arbitrary File Formats between compute nodes and HDFS data nodes Parallelizing Data Transfers in HDFS, using external table concept Imposing Structure on Unstructured Data of Hadoop clusters Exploiting Computational Resources

30 Data moves between clusters in parallel
Compute Nodes Engine Service SQL Server DMS DB Head Node Namenode (HDFS) Hadoop Cluster DataNode File System

31 Key Technical Challenges
in HDFS (e.g., Text, RC, ORC, Parquet, … ) Supporting Arbitrary File Formats between compute nodes and HDFS data nodes Parallelizing Data Transfers in HDFS, using external table concept Imposing Structure on Unstructured Data of Hadoop clusters Exploiting Computational Resources

32 Creating External Tables
CREATE EXTERNAL DATA SOURCE HadoopCluster WITH (TYPE = Hadoop, LOCATION = 'hdfs:// :8020', RESOURCE_MANAGER_LOCATION = ' :8050'); Once per Hadoop Cluster CREATE EXTERNAL FILE FORMAT TextFile WITH ( FORMAT_TYPE = DELIMITEDTEXT, DATA_COMPRESSION = 'org.apache.hadoop.io.compress.GzipCodec', FORMAT_OPTIONS (FIELD_TERMINATOR ='|', USE_TYPE_DEFAULT = TRUE)); Once per File Format CREATE EXTERNAL TABLE [dbo].[Customer] ( [SensorKey] int NOT NULL, [CustomerKey] int NOT NULL, [Speed] float NOT NULL ) WITH (LOCATION='//Sensor_Data//May2014/sensordata.tbl', DATA_SOURCE = HadoopCluster, FILE_FORMAT = TextFile HDFS File Path

33 Creating External Tables (Secure Hadoop)
CREATE DATABASE SCOPED CREDENTIAL HadoopCredential WITH IDENTITY = 'hadoopUserName', Secret = 'hadoopPassword'; Once per Hadoop user CREATE EXTERNAL DATA SOURCE HadoopCluster WITH (TYPE = Hadoop, LOCATION = 'hdfs:// :8020', RESOURCE_MANAGER_LOCATION = ' :8050', CREDENTIAL = HadoopCredential); Once per Hadoop user CREATE EXTERNAL FILE FORMAT TextFile WITH ( FORMAT_TYPE = DELIMITEDTEXT, DATA_COMPRESSION = 'org.apache.hadoop.io.compress.GzipCodec', FORMAT_OPTIONS (FIELD_TERMINATOR ='|', USE_TYPE_DEFAULT = TRUE)); Once per File Format CREATE EXTERNAL TABLE [dbo].[Customer] ( [SensorKey] int NOT NULL, [CustomerKey] int NOT NULL, [Speed] float NOT NULL ) WITH (LOCATION='//Sensor_Data//May2014/sensordata.tbl', DATA_SOURCE = HadoopCluster, FILE_FORMAT = TextFile HDFS File Path

34 Polybase Query Example #1
Selection on external table in HDFS select * from Customer where c_nationkey = 3 and c_acctbal < 0 A possible execution plan: EXECUTE QUERY Select * from T where T.c_nationkey =3 and T.c_acctbal < 0 3 IMPORT FROM HDFS HDFS Customer file read into T 2 CREATE temp table T Execute on compute nodes 1

35 Key Technical Challenges
in HDFS (e.g., Text, RC, ORC, Parquet, … ) Supporting Arbitrary File Formats between compute nodes and HDFS data nodes Parallelizing Data Transfers in HDFS, using external table concept Imposing Structure on Unstructured Data of Hadoop clusters Exploiting Computational Resources

36 PolyBase Query Execution
Query plan generator walks optimized query plan converting subtrees whose inputs are all HDFS files into sequence of MapReduce jobs Query is parsed “External tables” stored on HDFS are identified HDFS Hadoop Hadoop nodes Query Plan Generator SQL Query Query Optimizer Engine Service Parser Logical operator tree Engine Service submits MapReduce jobs (as a JAR file) to Hadoop cluster. Leverage computational capabilities of Hadoop cluster Physical operator tree Parallel QO is performed Statistics on HDFS tables are used in the standard fashion

37 Big Picture Takeaway Map job HDFS blocks SQL Query Results PDW or
1 Results 7 Hadoop 2 Map job PDW or SQL 16 SQL operations on HDFS data pushed into Hadoop as MapReduce jobs Cost-based decision on how much computation to push MapReduce 3 4 5 HDFS blocks 6 HDFS DB

38 PolyBase Query Example #2
Selection and aggregate on external table in HDFS select avg(c_acctbal) from Customer where c_acctbal < 0 group by c_nationkey Execution plan: What really happens here? Step 1) QO compiles predicate into Java and generates a MapReduce (MR) job Step 2) QE submits MR job to Hadoop cluster Output left in hdfsTemp hdfsTemp <US, $ > <FRA, $ > <UK, $-63.52> Run MR Job on Hadoop Apply filter and compute aggregate on Customer. 1

39 PolyBase Query Example #2
Selection and aggregate on HDFS table select avg(c_acctbal) from Customer where c_acctbal < 0 group by c_nationkey Execution plan: Predicate and aggregate pushed into Hadoop cluster as a MapReduce job Query optimizer makes a cost-based decision on what operators to push RETURN OPERATION Select * from T 4 IMPORT hdfsTEMP Read hdfsTemp into T 3 CREATE temp table T On compute nodes 2 hdfsTemp <US, $ > <FRA, $ > <UK, $-63.52> Run MR Job on Hadoop Apply filter and computes aggregate on Customer. Output left in hdfsTemp 1

40 Full Integration with Microsoft Office & BI
Highest Possible Performance Parallelized data transfers between SQL Compute Nodes and Hadoop clusters. Push down of SQL operations to Hadoop Simplicity Query data in Hadoop and/or data in SQL Server via standard T-SQL Open Supports most popular Hadoop distributions for both Linux and Windows PolyBase Full Integration with Microsoft Office & BI Excel’s PowerPivot, PowerView, Tableau, Cognos, SQL Server Reporting & Analysis Services

41 External tables referring to data in two HDP Hadoop clusters
Query Capabilities (1) Joining relational and external data SQL Table External tables referring to data in two HDP Hadoop clusters SELECT FROM Querying external tables Joining external with regular SQL tables Pushing compute for basic expressions and aggregates

42 Query Capabilities (2) Push-Down Computation
Pushing Compute Either on data source level or Per-query basis using query hints

43 Query Capabilities (3) Multiple User IDs Credential support
Credential support for multiple user IDs associated with external data source

44 Query Capabilities (4) Seamless BI integration
External Tables ‘just’ show up as regular tables Excel (e.g. PowerQuery) Tableau

45 External table referring to data in HDP Hadoop clusters
Parallel Import/Load Scenario new SQL Table created External table referring to data in HDP Hadoop clusters SELECT INTO: Importing data from Hadoop for higher speed access ‘ETL’ type of processing possible via T-SQL

46 External table for aging data into Hadoop (as text files)
Export Scenario – Data aging into Hadoop External table for aging data into Hadoop (as text files) INSERT INTO Exporting SQL data into Hadoop ‘ETL’ type of processing possible via T-SQL

47 Some Questions You Might Have…
Will alwaysOn (Hadron) be supported? Can a compute node be used for other SQL workloads? Can a compute node share a machine with a Hadoop data node? What SQL Server editions will I need? Will the MapR Hadoop distribution be supported? Is there a limit on the number of compute nodes? Why is it not possible to support different Hadoop distros simultaneously?

48 PolyBase what? what’s next why? how?

49 CTP3/RTM Limitations Each SQL 16 instance can participate in a single PolyBase cluster – A limitation of current DMS software No support for varchar(max) or unique column types 3. No scale out for joins between tables on Head Node and external tables on HDFS (“Local table scaleout”)

50 A Very Short Demo SQL 16 w. Polybase (4 node SQL Scaleout Group)
3 node Hortonworks 2.0 Linux cluster Car telemetry data in Hadoop (CarSensor_Data) Customer data in SQL 16 (Insured_Customers)

51 Wrapup: PolyBase in SQL Server 16
Simplicity - Query data in Hadoop, Azure and SQL Server via standard T-SQL Highest Possible Performance - Parallelized data transfers between SQL 16 instances and Hadoop data nodes. Push down of SQL operations to Hadoop using MR jobs Open - Supports most popular Hadoop distributions for both Linux and Windows Full Integration with Microsoft Office & BI - Excel’s PowerPivot, PowerView, Tableau, Cognos, SQL Server Reporting & Analysis Services

52 Acknowledgements The entire PolyBase development team
Sahaj Saini for the demo and a few slides Artin Avanes for his role as the PolyBase PM since inception

53 THANK YOU

54 CTP3/RTM Query Processing Limitation
Select C.Name from Orders O, Customers C where C.id = O.CustId and O.price > $1000 tmp Probable CTP2 Query Plan: 1) Use MR job to find Orders > $1000 2) Import result into SQL 16 instance on Head Node 3) Perform join on Head Node

55 A Better Query Plan Select C.Name from Customers C, Orders O where C.id = O.CustId and P.price > $1000 CTP3 plan likely to be: 1) Use MR job to find Orders > $1000 2) Import result into SQL 16 instances on Compute nodes 3) Redistribute Customers table from Head Node to Compute Nodes 4) Perform join in parallel on compute nodes

56 The Hadoop Ecosystem Management & Monitoring (Ambari) Coordination
(ZooKeeper) Workflow &Scheduling (Oozie) Scripting (Pig) Machine Learning (Mahout) Query (Hive) Distributed Processing (MapReduce) Distributed Storage (HDFS) NoSQL Database (HBase) (Sqoop/REST/ODBC) Data Integration

57 SQL Server PDW with PolyBase
PolyBase = SQL Server PDW V2 querying HDFS/Azure data, in-situ SQL Results HDFS PolyBase Polybase Standard T-SQL query language. Eliminates need for writing MapReduce jobs SQL SERVER PDW V2 All of this is now part of SQL Server 16 Leverages PDW’s parallel query execution framework Data moves in parallel directly between Hadoop’s Data Nodes and PDW’s compute nodes DB Exploits PDW’s parallel query optimizer to selectively push computations on HDFS data as MapReduce jobs

58 PDW Software Control Node PDW Engine Service SQL Server
Client Connections User Queries Data Movement Service (DMS) Separate process on each node Shuffles intermediate tables among compute nodes during query execution Compute Node DMS SQL Server JDBC, OLEDB, ODBC, ADO.NET Parse SQL Validate and authorize Optimize and build query plan Execute parallel query Return results to client Compute Node Control Node DMS SQL Server PDW Engine Service Compute Node DMS SQL Server DMS Service SQL Server PDW comes in a number of hardware configurations, with the necessary software pre-installed and ready to use. It has a control node that manages a number of compute nodes (see Figure 1). The control node provides the external interface to the appliance, and query requests flow through it. The control node is responsible for query parsing, creating a distributed execution plan, issuing plan steps to the compute nodes, tracking the execution steps of the plan, and assembling the individual pieces of the final results into the single result set that is returned to the user. Compute nodes provide the data storage and the query processing backbone of the appliance. The control and compute nodes each have a single instance of SQL Server RDBMS running on them. User data is stored in tables that are hash-partitioned or replicated tables across the SQL Server instances on the compute nodes. To execute a query, the control node transforms the user query into a distributed execution plan (called DSQL plan) that consists of a sequence of operations (called DSQL Operations). At a highlevel, every DSQL plan is composed of two types of operations: (1) SQL operations, which are SQL statements to be executed against the underlying compute nodes’ DBMS instances, and (2) DMS op- erations1 which are operations to transfer data between DBMS instances on different nodes. Compute Node DMS SQL Server SQL Server Compute Node DMS SQL Server

59 PDW SUMMARY KEY COMPONENTS Scalable SQL Server DW offering
Highly competitive performance and cost Currently only available in appliance form factor But, soon available as SQL DW Service in Azure PolyBase for SQL 16 reuses Engine Service and DMS components KEY COMPONENTS Multiple compute nodes Each with a SQL Server instance + DMS Execute SQL queries in parallel One control node Engine Service + DMS Compiles and controls execution of SQL queries in parallel

60 Step 3: Install one or more SQL Server instances
Step 3: Scale Out Step 3: Install one or more SQL Server instances SQL 16 PolyBase DLLs SQL 16 SQL installer will allow DBA to specify desire to use PolyBase feature. If selected, installer will install PolyBase DLLs (DMS and Engine Service) on each machine and register them as Windows Services Step 2

61 Typical PolyBase Setup
Compute Nodes Engine Service PDW Appliance SQL Server DMS DB Control Node Hortonworks or Cloudera Hadoop distributions Many popular HDFS file formats supported (text, RC, ORC, … ) Hadoop cluster can be either on premise or in the cloud Windows Cluster Hadoop cluster can be either: Namenode (HDFS) DataNode File System Linux Cluster Text Format RCFile ORCFile Parquet (future) Hadoop Cluster

62 Key Components Reused for SQL Server 16
HDFS Bridge in DMS HDFS Bridge in DMS used to read/write files/directories in HDFS or Azure DMS used to move data between compute nodes and Hadoop data nodes MapReduce Job Pushdown used by Engine Service to push computation to Hadoop cluster MR Job Pushdown Engine Service to parse, optimize, & orchestrate parallel execution of queries over relational tables and HDFS data External Table Construct used to “surface” records in external tables in HDFS or Azure files to SQL


Download ppt "PolyBase in SQL Server 16 David J. DeWitt Rimma V. Nehme"

Similar presentations


Ads by Google