The Model Architecture with SQL and Polybase

The Model Architecture with SQL and Polybase
Josh Fennessy Principal BlueGranite

Users Data Mart Data Warehouse Data Mart Reports Data Mart Results

Users Reports Results Data Lake Data Warehouse Data Mart Data Mart Data Mart

Users Reports Results Data Lake Data Warehouse Data Mart Data Mart Polybase Data Mart

Polybase Enables SQL statements to be executed against EXTERNAL data.
Stored in HDFS in Hadoop Stored in Azure Blob Storage or Azure Data Lake Store Available in SQL 2016+, Azure Data Warehouse, Analytics Platform System(APS)

Requirements SQL Server (EE) Azure Data Warehouse
Service(s) installed (requires Java) Polybase Query Service Polybase Data Movement Service (can be scaled out!) Service enabled on instance Azure Data Warehouse You already have it!

Uses for POLYBASE External Query Data Export Data Import
ETL / Analysis Online Archive Data Export Data Import

PolyBase Builds The Bridge
Just-in-Time data integration Across relational and non-relational data High performance parallel architecture Fast, simple data loading Best of both worlds Uses computational power at source for both relational data & non-relational files Opportunity for new types of analysis Uses existing analytical skills Familiar SQL semantics & behaviour Query with familiar tools SSDT, SQLCMD, PowerBI, … PolyBase = run time integration Includes Power BI

Loosely Coupled Architecture
Late Binding Consequences Data may change between executions Data may change during execution Errors identified at run time All “By Design” Helps PolyBase keep its agnostic architecture

PolyBase – Enhancements
PolyBase: Recursive Directory Traversal Enables users to retrieve the content of a folder and all subfolders Removes the burden of creating external tables for each subfolder PolyBase: ORCFile support Enables all Polybase scenarios to run against the file format ORCFile

Recursive Folder Traversal
/Sales/ /201501/ txt txt /201502/ _ txt txt /.reject/ txt txt PolyBase reads data recursively by default PolyBase ignores objects (and their children) prefixed by _ or .

Table creation - Hadoop
Staying Agnostic External Data Sources External Tables External File Format

External Tables Are metadata, which describe schema of the external data Enable data access outside the MPP DWH Database Engine They never hold data They do not delete data when dropped Behaviour of an external table, is in the MPP DWH Engine, very similar to Hive external tables

Considerations External Table
Data can be changed or removed at any time on Hadoop side PolyBase will not guarantee any form of concurrency control or isolation level Same query may return different results—data gets changed on storage side between two query runs Query may fail if data gets removed or relocated Location of the data residing on an external cluster gets validated every time a user selects from it

External Tables – Catalog Views
Logical table in shell database (control node) sys.external_tables sys.tables

External File Format - ORCFiles
O=Optimised  Better Compression Better Performance Features similar to SQL Server in memory Segment elimination Batch mode execution

External File Format - Limitations
Row Terminator is fixed as \n Encoding: UTF8 or UTF16 Compression choice may be limited by format

Table Level Statistics
When an external table is created, table level statistics are also persisted Row count Page count

Achieving Parallel Writes
Exported Files use unique naming convention {QueryID}_{YearMonthDay}_{HourMinutesSeconds}_{FileIndex}.txt Also good for lineage (presence of QID) File Index is zero based Relationship is 1:1 with distributions File extension used depends on file format chosen TXT RCF If the External Table is dropped and the same CETAS is re-executed then the target folder will have doubled its contents!

Parallel Writes for a Distributed Table

Parallel Writes with Replicated Tables
Are not attainable in the current version Replicated tables are written to a single file Only one replicated table will be queried

Column Level Statistics
Provide the additional data we need Crucial for cardinality estimation Enabled for External Tables Manual operation

Polybase: Create Credentials (Azure Storage)
CREATE MASTER KEY; CREATE DATABASE SCOPED CREDENTIAL [AzureStorageUser] WITH IDENTITY = 'user', SECRET = '<key>' Creating a database master key is necessary to encrypt credentials When connecting to Azure storage, the IDENTITY option isn't used, any value will work here The SECRET is the Azure Storage key that is used to gain full access to the storage account. Protect this as you would any administrator credentials

Polybase: Create Credentials (Azure Data Lake Store)
CREATE MASTER KEY; CREATE DATABASE SCOPED CREDENTIAL [ADLUser] WITH IDENTITY = SECRET = '<key>'; NOTE: ADLS is not available in gov cloud yet. Creating a database master key is necessary to encrypt credentials. The master key only needs to be created once With ADLS, the IDENTITY property is used to specify the service principal that has been granted access to the Data Lake store The SECRET is the secure key that has been generated for the service principal. Protect this as you would any administrator credentials

Polybase: EXTERNAL DATA SOURCE (Azure Storage)
CREATE EXTERNAL DATA SOURCE [AzureStorage] WITH TYPE = HADOOP, LOCATION = CREDENTIAL = AzureStorageUser Creating a database master key is necessary to encrypt credentials When connecting to Azure storage, the IDENTITY option isn't used, any value will work here The SECRET is the Azure Storage key that is used to gain full access to the storage account. Protect this as you would any administrator credentials

Polybase: EXTERNAL DATA SOURCE (Azure Data Lake)
CREATE EXTERNAL DATA SOURCE [AzureStorage] WITH TYPE = HADOOP, LOCATION = 'adl[s]://acctname.azuredatalakestore.net', CREDENTIAL = AzureStorageUser Creating a database master key is necessary to encrypt credentials When connecting to Azure storage, the IDENTITY option isn't used, any value will work here The SECRET is the Azure Storage key that is used to gain full access to the storage account. Protect this as you would any administrator credentials

Polybase: EXTERNAL TABLE
CREATE EXTERNAL TABLE [TableName] ( … ) WITH LOCATION = 'folder path', DATA_SOURCE = [DataSourceName], FILE_FORMAT = [FormatName], REJECT_OPTIONS = … Column definition for an external table is similar to creating a normal table.

Polybase: EXTERNAL FILE FORMAT
CREATE EXTERNAL FILE FORMAT [FormatName] WITH FORMAT_TYPE = DELIMIITEDTEXT|RC|ORC|PARQUET, FORMAT_OPTIONS ( FIELD_TERMINATOR = '|', DATE_FORMAT = 'MM/dd/yyyy', STRING_DELIMITER = '"', ), DATA_COMPRESSION = GZIP|SNAPPY|DEFAULT Creating a database master key is necessary to encrypt credentials When connecting to Azure storage, the IDENTITY option isn't used, any value will work here The SECRET is the Azure Storage key that is used to gain full access to the storage account. Protect this as you would any administrator credentials

Create an ONLINE Archive

Online Archive By moving data from expensive storage to a cheaper platform, archives can be left online and usable when needed. Polybase can write to Azure Storage Blobs, Azure Data Lake Store or an on-premises Hadoop Cluster

ACCESS an ONLINE Archive

Online Archive Accessing data in an online archive is done by querying an external table When a query is executed, data is moved from the archive to tempDB on the SQL Server where the query execution plan is run to return the results

QUERY DATA IN HAOOP

Online Archive Setting up a connection to data in Hadoop is similar to working with blob storage. When a query is executed there are two paths that can be executed. If PUSHDOWN is enabled, a MapReduce job may be run on the Hadoop cluster to satisfy part of the query The DMS will move a portion or an entire data set to tempDB for further processing

POLYBASE AS ETL TOOL

Guidelines for ETL If using as a data import tool
Ensure your data is split into as many files as you have readers available in your Polybase scale out cluster Avoid doing any transformations during the import process. Best performance will be found by importing the data straight into SQL When connected to a Hadoop Cluster Experiment executing jobs with and without pushdown Often, allowing Hadoop to perform filtering steps will result in better performance since less data will be transferred to SQL.

REcap Polybase is a SQL Server service. Enterprise Edition required for master, Standard for workers. Polybase can read and/or write to Azure Storage Blobs, Azure Data Lake Store or an on-premises Hadoop Cluster

REcap Polybase can push down MapReduce jobs to Hadoop.
It does not use the Hive metastore. Can be a great way to allow analysts with only SQL experience to consume data in Hadoop. SQL security models can be applied to external tables, which means that the access to Hadoop can still be controlled. Polybase can also use integrated security to Hadoop with Kerberos.

REcap Polybase is not optimized for performing interactive query workloads. Using Polybase as a distributed data movement tool is a better choice For archive data that needs to be queried often, such as during an audit, moving the data to a location in SQL Server during the heavy query usage is recommended

The Model Architecture with SQL and Polybase

Similar presentations

Presentation on theme: "The Model Architecture with SQL and Polybase"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The Model Architecture with SQL and Polybase

Similar presentations

Presentation on theme: "The Model Architecture with SQL and Polybase"— Presentation transcript:

Similar presentations

About project

Feedback