Microsoft Analytics Platform System

Slides:

Advertisements

Similar presentations

SSRS 2008 Architecture Improvements Scale-out SSRS 2008 Report Engine Scalability Improvements.

Advertisements

Dos and don’ts of Columnstore indexes The basis of xVelocity in-memory technology What’s it all about The compression methods (RLE / Dictionary encoding)

Web RoleWorker Role At runtime each Role will execute on one or more instances A role instance is a set of code, configuration, and local data, deployed.

MIX 09 4/15/ :14 PM © 2009 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered.

Implementing Business Analytics with MDX Chris Webb London September 29th.

Understanding Active Directory

Name Title Microsoft Windows Azure: Migrating Web Applications.

Introduction to Big Data and Hadoop Name Title Microsoft Corporation.

Built by Developers for Developers…. © 2009 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names.

Using the WDK for Windows Logo and Signature Testing Craig Rowland Program Manager Windows Driver Kits Microsoft Corporation.

An Introduction to HDInsight June 27 th,

© 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or.

Operating System for the Cloud Runs applications in the cloud Provides Storage Application Management Windows Azure ideal for applications needing:

© 2009 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or.

Breaking points of traditional approach What if you could handle big data?

MSBIC Hadoop Series Hadoop & Microsoft BI Bryan Smith

PolyBase overview Speaker Name

Data Platform and Analytics Foundational Training

Build /26/2018 6:17 AM Building Resilient, Scalable Services with Microsoft Azure Service Fabric Érsek © 2015 Microsoft Corporation.

Business Continuity & Disaster Recovery

PolyBase: T-SQL Reaching Beyond the Database

Data Platform and Analytics Foundational Training

System Center Marketing

System Center Marketing

Microsoft Machine Learning & Data Science Summit

Microsoft /2/2018 3:42 PM BRK3129 Query Big Data using the Expanded T-SQL footprint with PolyBase in SQL Server 2016 Casey Karst Program Manager.

Microsoft Azure: The only consistent Hybrid Cloud

Why Is My SQL DW Query Slow?

Data Platform and Analytics Foundational Training

The Model Architecture with SQL and Polybase

Data Platform and Analytics Foundational Training

Design and Implement Cloud Data Platform Solutions

9/13/2018 © 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks.

HDInsight makes Hadoop Easy

A developers guide to Azure SQL Data Warehouse

Business Continuity & Disaster Recovery

Entity Based Staging SQL Server 2012 Tyler Graham

Power Apps & Flow for Microsoft Dynamics SL

Overview of Azure Data Lake Store

11/11/2018 Desktop Virtualization Corey Hynes Kyle Rosenthal President Technical Lead HynesITe Inc Spider Consulting @windowspcguy.

Microsoft Azure P wer Lunch

Server & Tools Business

Matt Masson Software Development Engineer Microsoft Corporation

Microsoft Analytics Platform System

A developers guide to Azure SQL Data Warehouse

Microsoft Virtual Academy

Microsoft Virtual Academy

Kasper de Jonge Microsoft Corporation

Azure SQL DWH: Optimization

Overview of big data tools

Building continuously available systems with Hyper-V

More Virtual Machines 2.

Upgrading Your Private Cloud with Windows Server 2012 R2

Developing for Windows Azure

Common Data Service Data Integrator

Inside SQL Server Polybase

4/18/2019 9:46 AM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN.

Microsoft Analytics Platform System 03 – Distribution Theory & Design

Windows Azure Hybrid Architectures and Patterns

5/8/2019 3:20 AM bQuery-Tool 3.0 A new and elegant way to create queries and ad-hoc reports on your Baan/Infor ERP LN data. This Baan session is a query.

Шитманов Дархан Қаражанұлы Тарих пәнінің

Day 2, Session 2 Connecting System Center to the Public Cloud

Server & Tools Business

Making Windows Azure Relevant to IT Professionals

Microsoft Virtual Academy

Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.

Presentation transcript:

Microsoft Analytics Platform System 05 – HDI Region Software, Tools & Polybase Brian Walker | Microsoft Architect – Data Insights COE Jesse Fountain | Microsoft WW TSP Lead September 20, 2018

Agenda Region Overview High Availability Tooling Data Loading Hive Polybase

Region Overview

Hadoop Region & HDInsight HDInsight = Microsoft branding Hortonworks distribution (HDP) Hadoop region based on HDP 2.0 Basic authentication High Availability built into all nodes

Supported Projects Hadoop Core (HDFS & MapReduce) Oozie Hive Templeton Pig Sqoop

Hadoop Region Architecture

Nodes Hadoop Region Dependency Nodes Head Node Secure Gateway Management Node Data Node PDW Control Node Active Directory Virtual Machine Manager

HDInsight APIs WebHDFS – Remote HDFS file system management WebHCat – Remote job submission and monitoring Oozie – Remote workflow submission and scheduling HiveServer2 – ODBC Connectivity to Hive

Hive ODBC Connector (Excel) Setup same as cloud HDI Secure Node Cluster IP Port 443 Externally trusted certificate required

High Availability

Hadoop Region HA Orchestration and passive host failover behaves the same as for PDW Data nodes are different APS relies on Hadoop data replication for data availability Disks are not mirrored Data nodes do not failover Replication factor is configurable #Scale Units Replication Factor Polybase =1 2 3 >1

HDI Name Node Failure HHN01 node marked as failed WFOHST02 HST03 HHN01 HSN01 HMN01 HST04 HMN01 HHN01 HSN01 HSN02 persists on HST04 HSA07 HDN001 HDN002 ISCSI07 DAS01 HSA08 HDN003 HDN004 ISCSI08 Cluster fails over to HST04 HSA09 HDN005 HDN006 ISCSI09 DAS02 HSA10 HDN007 HDN008 ISCSI10 HST04 already “warm” so Failover is fast HSA11 HDN009 HDN010 ISCSI11 DAS03 HSA12 HDN011 HDN012 ISCSI12

Data Node Failure Data node fails WFOHST02 HST03 HHN01 HSN01 HMN01 Data node does not fail over HST04 HSA07 HDN001 HDN002 ISCSI07 DAS01 Hadoop data replication ensures data is available on other data nodes HSA08 HDN003 HDN004 ISCSI08 HSA09 HDN005 HDN006 ISCSI09 DAS02 ISCSI VM does not fail over HSA10 HDN007 HDN008 ISCSI10 HSA11 HDN009 HDN010 ISCSI11 DAS03 Replication is relied upon for availability HSA12 HDN011 HDN012 ISCSI12

HDInsight Tooling

Loading Data into HDFS

Loading Data From Flat Files Designed for Map Job Developer Dashboard WebHDFS Hive Batch loading Small files (<20MB) Medium files From HDFS

Loading Data From a Database Designed for Polybase SQOOP Hybrid integration Database integration If your source database is PDW, then use Polybase not SQOOP

Loading Data with Hive Moves the data into the table LOAD DATA LOAD DATA LOCAL INSERT OVERWRITE INSERT INFO Moves the data into the table Copies the data into the table Overwrites existing data in table or partition Appends to table Data must already exist inside HDFS to load data with Hive

Working with Hive

Creating a Database in Hive CREATE DATABASE [IF NOT EXISTS] db_name; [COMMENT database_comment] [LOCATION hdfs_path] [WITH DBPROPERTIES (property_name=property_value, ...)];

Creating a Database in Hive Only Create Database is required If you specify a location you must have created the location first If you don’t specify a location Hadoop will create the database in the /hive/ folder You can create properties but you cannot remove them

Creating a Hive Table Create Table Create External Table Data Type Sorting / Bucketing Comment (table & column) Row Format Partitions Storage Format Clustering Specify an alternate location in HDFS When dropped only the definition is dropped not the data

Creating Hive Tables Create Table as Select Create Table Like Defines and Populates table with SELECT Cannot be partitioned Cannot be an external table Copies the table definition without copying the data Can create a table based on a view definition

Querying Data Looks and behaves like SQL, but not SQL HQL – not SQL Hive does not offer guarantees of RDBMS No Updates or Delete Hive queries data inside HDFS only Whole Partitions can be re-written

What is Polybase?

PolyBase unites UNSTRUCTURED DATA STRUCTURED DATA BUSINESS DATA …for a better together world of analytics

Agnostic architecture PolyBase is agnostic = No vendor lock in PolyBase supports Hadoop on Linux & Windows PolyBase integrates with the cloud PolyBase supports HDInsight in APS & external Hadoop clusters

What’s the sweet spot for PolyBase? Consumer Analyst Scientist Data Volume Medium to Low Reasonable High -> Huge Degree of Structure Very High Some Low ->None Number of Users Medium Low Transformation Complexity Medium to High High Analytics Complexity Partial fit for PolyBase today Structure possibly absent on data Good option for data delivery & transform Great for PolyBase Structured Data Iterative Query Response Hybrid Queries across sources Good for PolyBase High Structured Data Fast interactive response

PolyBase builds the bridge Just-in-Time data integration Across relational and non-relational data High performance parallel architecture Fast, simple data loading Best of both worlds Uses computational power at source for both relational data & Hadoop Opportunity for new types of analysis Uses existing analytical skills Familiar SQL semantics & behaviour Query with familiar tools SSDT PolyBase = run time integration Includes Power BI

So what is PolyBase? Answer: Component of the PDW Region in APS Highly parallelised distributed query engine accessing heterogeneous data via SQL Answer: Unique Innovative Technology Answer: Seamless Integration

Any data in any format

Deployment choices Hortonworks Hadoop On Windows (External) Hadoop On Linux (External) Cloudera CDH On Linux (External) HDInsight On APS (Internal) HDInsight On WASB (External)

External tables Metadata used to describe external data Enables data access outside the PDW region Never hold data Do not delete data when dropped

Create external table CREATE EXTERNAL TABLE [dbo].[Sales] ([ProductKey] int NOT NULL ,[StoreKey] int NOT NULL ,[DateKey] int NOT NULL ,[CustomerKey] int NOT NULL ,[PromotionKey] int NOT NULL ,[OrderQuantity] int NOT NULL ,[UnitPrice] money NOT NULL ,[SalesAmount] money NOT NULL )

External tables WITH (LOCATION='hdfs://filepath_or_directory' ,DATA_SOURCE = MyDataSourceName ,FILE_FORMAT = MyFileFormatName ,REJECT_TYPE = VALUE ,REJECT_VALUE = 0 ,REJECT_SAMPLE_VALUE = 1000 );

Parallel data transfer

Parallel transfer concepts Maximize Throughput Every compute node in PDW sees every data node in Hadoop Ensure direct connections are established between all scale out nodes of PDW & Hadoop Balanced Execution Ensure all nodes are equally busy when reading and writing data

Maximizing throughput CTL01 CMP01 CMP02 CMP03 CMP04 CMP05 CMP06 HHN01 HDN001 HDN002 HDN003 HDN004 HDN005 HDN006 HDN007 HDN008 HDN009 HDN010 HDN011 HDN012

PolyBase and DMS Implemented as a DMS extension A new bridge component has been added to DMS Bridge supports pluggable interfaces for heterogeneous data access Bridge abstracts the complexity of Hadoop A Java Native Interface (JNI) layer provides interoperability with the rest of DMS DMS shrink wraps HDFS Bridge with new “external” movement types

PolyBase External Table PDW Engine Service PDW Bridge User Perspective External Data Source External File Format Systems Perspective PDW Engine Service PDW Bridge

Table-level statistics When an external table is created table level statistics are also persisted as metadata on control node Row count Page count

Table statistics values Row count 1000 rows Fixed default Page count Based on file size as understood by Hadoop name node Converted to pages Influenced by compression

What are table statistics good for? File Binding Verifies existence of file/folder Estimate row length & number of rows Sizes the file Split Generation Calculate # of “splits” to allocate per compute node

Data export & data movement

Exporting data with CETAS CETAS – CREATE EXTERNAL TABLE AS SELECT Post export three statements will be true External table will now exist Data will have been exported Row & page count updated on external table

CETAS: Additional guidance Integration point is the file system HDFS or WASB[s] Not Hive or HCatalog Target is either a folder or a file Target does not have to already exist External table name must not exist in PDW DB Round-Tripping is perfectly possible PolyBase will make a one-time best effort at clean-up

Hybrid queries

What are hybrid queries? Read data from multiple external data sources HDFS PDW WASB[S] Hybrid = Multitude of data sources accessed in a single query

External data movement types Three basic moves mirroring internal movement ExternalRoundRobinMove ExternalShuffleMove ExternalBroadcastMove

ExternalRoundRobinMove SELECT * FROM dbo.HDFS_Web_Sales Also known as the Random Hash Buffers re-distributed evenly across the compute nodes

ExternalBroadcastMove Both tables are external to PDW SELECT i_item_id , s_store_id FROM dbo.HDFS_Item CROSS JOIN dbo.HDFS_Store ; An external broadcast move is used as it is cheaper to broadcast immediately than it is to import the data and then broadcast

ExternalShuffleMove SELECT i_item_id , ws_item_sk , SUM(ws_net_profit) NetProfitCurrentMonth FROM dbo.HDFS_web_sales ws JOIN dbo.date_dim dd ON ws.ws_sold_date_sk = dd.d_date_sk JOIN dbo.item i ON ws.ws_item_sk = i.i_item_sk WHERE dd.d_current_month = 'Y' GROUP BY OPTION (LABEL = 'External Shuffle Move') ; Hybrid Query

Data import & data movement

Return of CTAS Use CTAS to Additional steps included in the MPP plan Perform a parallel import of data via PolyBase Movement types are the same as hybrid Additional steps included in the MPP plan Persist the results in PDW Check permissions Create extended properties Update Table level Statistics

Importing data with CTAS CREATE TABLE Agg_ProductProfitCurrentMonth WITH (DISTRIBUTION = HASH(ws_item_sk)) AS SELECT i_item_id , ws_item_sk , SUM(ws_net_profit) NetProfitCurrentMonth FROM dbo.HDFS_web_sales ws JOIN dbo.date_dim dd ON ws.ws_sold_date_sk = dd.d_date_sk JOIN dbo.item i ON ws.ws_item_sk = i.i_item_sk WHERE dd.d_current_month = 'Y' GROUP BY OPTION(LABEL = 'CTAS : External Shuffle Move') ;

Split query execution

Split query processing PDW Engine Service (Powered with PolyBase) Data Import & Export Generate MapReduce Jobs Bridge (DMS) Job Submission (Hadoop) Maybe as a result of PolyBase MR! Transparent & on the fly

Using split query Map Job designed to minimize movement Push predicates down to remote data store Reduce data volume to transfer

Understanding overheads Table level stats only give size of table Selectivity of data needs to be considered Map job output must be persisted in Hadoop Need additional data to decide!

Column level statistics Provides the additional data we need Crucial for cardinality estimation Enabled for External Tables Manual operation CREATE / DROP Only – not Update

Understanding costs Submitting Hadoop jobs is costly Spin-up time ~20-30 seconds Consequently… If PDW Engine estimates (based on stats) an execution time of less than 20-30 seconds there will be no push down

Pushdown trigger point Push down will not be considered for: Data Transfers < 1GB per distribution Faster to simply import the data New AU2 Query Hint OPTION (FORCE | DISABLE EXTERNALPUSHDOWN)

Selection: filter rows FROM HDFS_Customer c WHERE c.account_balance < 20000 Push-able SELECT * FROM HDFS_Customer c WHERE c.JobTitle IN ('Developer', 'Tester') Not Push-able SELECT * FROM HDFS_Clickstream c WHERE c.IP_address BETWEEN 127.0.0.1 AND 127.0.0.7 Possibly Push-able

Projection: filter columns SELECT c.ac FROM HDFS_Customer c Simple Projection SELECT c.first_name+' '+c.last_name FROM HDFS_Customer c Pushdown Projection SELECT c.first_name , c.last_name , c.first_name+' '+c.last_name FROM HDFS_Clickstream c Not Pushed Projection

Microsoft Analytics Platform System 9/20/2018 © 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION. © 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.