The Model Architecture with SQL and Polybase

Slides:



Advertisements
Similar presentations
Module 8 Importing and Exporting Data. Module Overview Transferring Data To/From SQL Server Importing & Exporting Table Data Inserting Data in Bulk.
Advertisements

Technical BI Project Lifecycle
Putting the Sting in Hive Page 1 Alan F.
SSIS Over DTS Sagayaraj Putti (139460). 5 September What is DTS?  Data Transformation Services (DTS)  DTS is a set of objects and utilities that.
Distributed Systems Fall 2014 Zubair Amjad. Outline Motivation What is Sqoop? How Sqoop works? Sqoop Architecture Import Export Sqoop Connectors Sqoop.
An Introduction to HDInsight June 27 th,
SharePoint enhancements through SQL Server RSS integration with SharePoint What’s New Elimination of IIS
Windows Azure. Azure Application platform for the public cloud. Windows Azure is an operating system You can: – build a web application that runs.
Sponsorzy strategiczni Sponsorzy srebrni. PolyBase – data beyond tables Hubert Kobierzewski.
Impala. Impala: Goals General-purpose SQL query engine for Hadoop High performance – C++ implementation – runtime code generation (using LLVM) – direct.
Please note that the session topic has changed
Aggregator Stage : Definition : Aggregator classifies data rows from a single input link into groups and calculates totals or other aggregate functions.
Log Shipping, Mirroring, Replication and Clustering Which should I use? That depends on a few questions we must ask the user. We will go over these questions.
AZ PASS User Group Azure Data Factory Overview Josh Sivey, Solution Partner October
SQL Basics Review Reviewing what we’ve learned so far…….
1 Copyright © 2008, Oracle. All rights reserved. Repository Basics.
PolyBase Query Hadoop with ease Sahaj Saini Program Manager, Microsoft.
Views / Session 3/ 1 of 40 Session 3 Module 5: Implementing Views Module 6: Managing Views.
Redmond Protocols Plugfest 2016 Casey Karst PolyBase in SQL Server 2016.
Practical Hadoop: do’s and don’ts by example Kacper Surdy, Zbigniew Baranowski.
2 Copyright © 2008, Oracle. All rights reserved. Building the Physical Layer of a Repository.
11 Copyright © 2009, Oracle. All rights reserved. Enhancing ETL Performance.
Data Integrity & Indexes / Session 1/ 1 of 37 Session 1 Module 1: Introduction to Data Integrity Module 2: Introduction to Indexes.
SQL Database Management
Planning a Migration.
PolyBase overview Speaker Name
Backups for Azure SQL Databases and SQL Server instances running on Azure Virtual Machines Session on backup to Azure feature (manual and managed) in SQL.
Connected Infrastructure
4/18/2018 6:56 AM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN.
Data Platform and Analytics Foundational Training
In-Memory Capabilities
PolyBase: T-SQL Reaching Beyond the Database
What’s new in SQL Server 2017 for BI?
Introduction to Distributed Platforms
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Microsoft /2/2018 3:42 PM BRK3129 Query Big Data using the Expanded T-SQL footprint with PolyBase in SQL Server 2016 Casey Karst Program Manager.
Incrementally Moving to the Cloud Using Biml
Connected Infrastructure
Building Analytics At Scale With USQL and C#
Power BI Security Best Practices
SQOOP.
Using a Gateway to Leverage On-Premises Data in Power BI
Presented by: Warren Sifre
A developers guide to Azure SQL Data Warehouse
Enterprise security for big data solutions on Azure HDInsight
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Overview of Azure Data Lake Store
Migrating Your BI Platform To Azure
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Chapter 1 Database Systems
Server & Tools Business
A developers guide to Azure SQL Data Warehouse
Teaching slides Chapter 8.
MPP – Maximize Parallel Productivity
20 Questions with Azure SQL Data Warehouse
Azure SQL DWH: Optimization
Managing batch processing Transient Azure SQL Warehouse Resource
Database Systems Instructor Name: Lecture-3.
Setup Sqoop.
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Using Columnstore indexes in Azure DevOps Services. Lessons learned
Michelle Haarhues Keeping up with SSMS.
Using Columnstore indexes in Azure DevOps Services. Lessons learned
05 | Processing Big Data with Hive
Moving your on-prem data warehouse to cloud. What are your options?
Introduction to Azure Data Lake
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Pig Hive HBase Zookeeper
Presentation transcript:

The Model Architecture with SQL and Polybase Josh Fennessy Principal BlueGranite

Users Data Mart Data Warehouse Data Mart Reports Data Mart Results

Users Reports Results Data Lake Data Warehouse Data Mart Data Mart Data Mart

Users Reports Results Data Lake Data Warehouse Data Mart Data Mart Polybase Data Mart

Polybase Enables SQL statements to be executed against EXTERNAL data. Stored in HDFS in Hadoop Stored in Azure Blob Storage or Azure Data Lake Store Available in SQL 2016+, Azure Data Warehouse, Analytics Platform System(APS)

Requirements SQL Server (EE) Azure Data Warehouse Service(s) installed (requires Java) Polybase Query Service Polybase Data Movement Service (can be scaled out!) Service enabled on instance Azure Data Warehouse You already have it!

Uses for POLYBASE External Query Data Export Data Import ETL / Analysis Online Archive Data Export Data Import

PolyBase Builds The Bridge Just-in-Time data integration Across relational and non-relational data High performance parallel architecture Fast, simple data loading Best of both worlds Uses computational power at source for both relational data & non-relational files Opportunity for new types of analysis Uses existing analytical skills Familiar SQL semantics & behaviour Query with familiar tools SSDT, SQLCMD, PowerBI, … PolyBase = run time integration Includes Power BI

Loosely Coupled Architecture Late Binding Consequences Data may change between executions Data may change during execution Errors identified at run time All “By Design” Helps PolyBase keep its agnostic architecture

PolyBase – Enhancements PolyBase: Recursive Directory Traversal Enables users to retrieve the content of a folder and all subfolders Removes the burden of creating external tables for each subfolder PolyBase: ORCFile support Enables all Polybase scenarios to run against the file format ORCFile

Recursive Folder Traversal /Sales/ /201501/ 20150101.txt 20150102.txt /201502/ _20150201.txt 20150201.txt /.reject/ .20150101.txt 20150301.txt PolyBase reads data recursively by default PolyBase ignores objects (and their children) prefixed by _ or .

Table creation - Hadoop Staying Agnostic External Data Sources External Tables External File Format

External Tables Are metadata, which describe schema of the external data Enable data access outside the MPP DWH Database Engine They never hold data They do not delete data when dropped Behaviour of an external table, is in the MPP DWH Engine, very similar to Hive external tables

Considerations External Table Data can be changed or removed at any time on Hadoop side PolyBase will not guarantee any form of concurrency control or isolation level Same query may return different results—data gets changed on storage side between two query runs Query may fail if data gets removed or relocated Location of the data residing on an external cluster gets validated every time a user selects from it

External Tables – Catalog Views Logical table in shell database (control node) sys.external_tables sys.tables

External File Format - ORCFiles O=Optimised  Better Compression Better Performance Features similar to SQL Server in memory Segment elimination Batch mode execution

External File Format - Limitations Row Terminator is fixed as \n Encoding: UTF8 or UTF16 Compression choice may be limited by format

Table Level Statistics When an external table is created, table level statistics are also persisted Row count Page count

Achieving Parallel Writes Exported Files use unique naming convention {QueryID}_{YearMonthDay}_{HourMinutesSeconds}_{FileIndex}.txt Also good for lineage (presence of QID) File Index is zero based Relationship is 1:1 with distributions File extension used depends on file format chosen TXT RCF If the External Table is dropped and the same CETAS is re-executed then the target folder will have doubled its contents!

Parallel Writes for a Distributed Table

Parallel Writes with Replicated Tables Are not attainable in the current version Replicated tables are written to a single file Only one replicated table will be queried

Column Level Statistics Provide the additional data we need Crucial for cardinality estimation Enabled for External Tables Manual operation

Polybase: Create Credentials (Azure Storage) CREATE MASTER KEY; CREATE DATABASE SCOPED CREDENTIAL [AzureStorageUser] WITH IDENTITY = 'user', SECRET = '<key>' Creating a database master key is necessary to encrypt credentials When connecting to Azure storage, the IDENTITY option isn't used, any value will work here The SECRET is the Azure Storage key that is used to gain full access to the storage account. Protect this as you would any administrator credentials

Polybase: Create Credentials (Azure Data Lake Store) CREATE MASTER KEY; CREATE DATABASE SCOPED CREDENTIAL [ADLUser] WITH IDENTITY = '<client_id>@\<OAuth_2.0_Token_EndPoint>' SECRET = '<key>'; NOTE: ADLS is not available in gov cloud yet. Creating a database master key is necessary to encrypt credentials. The master key only needs to be created once With ADLS, the IDENTITY property is used to specify the service principal that has been granted access to the Data Lake store The SECRET is the secure key that has been generated for the service principal. Protect this as you would any administrator credentials

Polybase: EXTERNAL DATA SOURCE (Azure Storage) CREATE EXTERNAL DATA SOURCE [AzureStorage] WITH TYPE = HADOOP, LOCATION = 'wasb[s]://container@acctname.blob.core.usgovcloudapi.net', CREDENTIAL = AzureStorageUser Creating a database master key is necessary to encrypt credentials When connecting to Azure storage, the IDENTITY option isn't used, any value will work here The SECRET is the Azure Storage key that is used to gain full access to the storage account. Protect this as you would any administrator credentials

Polybase: EXTERNAL DATA SOURCE (Azure Data Lake) CREATE EXTERNAL DATA SOURCE [AzureStorage] WITH TYPE = HADOOP, LOCATION = 'adl[s]://acctname.azuredatalakestore.net', CREDENTIAL = AzureStorageUser Creating a database master key is necessary to encrypt credentials When connecting to Azure storage, the IDENTITY option isn't used, any value will work here The SECRET is the Azure Storage key that is used to gain full access to the storage account. Protect this as you would any administrator credentials

Polybase: EXTERNAL TABLE CREATE EXTERNAL TABLE [TableName] ( … ) WITH LOCATION = 'folder path', DATA_SOURCE = [DataSourceName], FILE_FORMAT = [FormatName], REJECT_OPTIONS = … Column definition for an external table is similar to creating a normal table.

Polybase: EXTERNAL FILE FORMAT CREATE EXTERNAL FILE FORMAT [FormatName] WITH FORMAT_TYPE = DELIMIITEDTEXT|RC|ORC|PARQUET, FORMAT_OPTIONS ( FIELD_TERMINATOR = '|', DATE_FORMAT = 'MM/dd/yyyy', STRING_DELIMITER = '"', ), DATA_COMPRESSION = GZIP|SNAPPY|DEFAULT Creating a database master key is necessary to encrypt credentials When connecting to Azure storage, the IDENTITY option isn't used, any value will work here The SECRET is the Azure Storage key that is used to gain full access to the storage account. Protect this as you would any administrator credentials

Create an ONLINE Archive

Online Archive By moving data from expensive storage to a cheaper platform, archives can be left online and usable when needed. Polybase can write to Azure Storage Blobs, Azure Data Lake Store or an on-premises Hadoop Cluster

ACCESS an ONLINE Archive

Online Archive Accessing data in an online archive is done by querying an external table When a query is executed, data is moved from the archive to tempDB on the SQL Server where the query execution plan is run to return the results

QUERY DATA IN HAOOP

Online Archive Setting up a connection to data in Hadoop is similar to working with blob storage. When a query is executed there are two paths that can be executed. If PUSHDOWN is enabled, a MapReduce job may be run on the Hadoop cluster to satisfy part of the query The DMS will move a portion or an entire data set to tempDB for further processing

POLYBASE AS ETL TOOL

Guidelines for ETL If using as a data import tool Ensure your data is split into as many files as you have readers available in your Polybase scale out cluster Avoid doing any transformations during the import process. Best performance will be found by importing the data straight into SQL When connected to a Hadoop Cluster Experiment executing jobs with and without pushdown Often, allowing Hadoop to perform filtering steps will result in better performance since less data will be transferred to SQL.

REcap Polybase is a SQL Server service. Enterprise Edition required for master, Standard for workers. Polybase can read and/or write to Azure Storage Blobs, Azure Data Lake Store or an on-premises Hadoop Cluster

REcap Polybase can push down MapReduce jobs to Hadoop. It does not use the Hive metastore. Can be a great way to allow analysts with only SQL experience to consume data in Hadoop. SQL security models can be applied to external tables, which means that the access to Hadoop can still be controlled. Polybase can also use integrated security to Hadoop with Kerberos.

REcap Polybase is not optimized for performing interactive query workloads. Using Polybase as a distributed data movement tool is a better choice For archive data that needs to be queried often, such as during an audit, moving the data to a location in SQL Server during the heavy query usage is recommended