Microsoft Analytics Platform System

Microsoft Analytics Platform System
05 – HDI Region Software, Tools & Polybase Brian Walker | Microsoft Architect – Data Insights COE Jesse Fountain | Microsoft WW TSP Lead September 20, 2018

Agenda Region Overview High Availability Tooling Data Loading Hive
Polybase

Region Overview

Hadoop Region & HDInsight
HDInsight = Microsoft branding Hortonworks distribution (HDP) Hadoop region based on HDP 2.0 Basic authentication High Availability built into all nodes

Supported Projects Hadoop Core (HDFS & MapReduce) Oozie Hive Templeton
Pig Sqoop

Hadoop Region Architecture

Nodes Hadoop Region Dependency Nodes Head Node Secure Gateway
Management Node Data Node PDW Control Node Active Directory Virtual Machine Manager

HDInsight APIs WebHDFS – Remote HDFS file system management
WebHCat – Remote job submission and monitoring Oozie – Remote workflow submission and scheduling HiveServer2 – ODBC Connectivity to Hive

Hive ODBC Connector (Excel)
Setup same as cloud HDI Secure Node Cluster IP Port 443 Externally trusted certificate required

High Availability

Hadoop Region HA Orchestration and passive host failover behaves the same as for PDW Data nodes are different APS relies on Hadoop data replication for data availability Disks are not mirrored Data nodes do not failover Replication factor is configurable #Scale Units Replication Factor Polybase =1 2 3 >1

HDI Name Node Failure HHN01 node marked as failed
WFOHST02 HST03 HHN01 HSN01 HMN01 HST04 HMN01 HHN01 HSN01 HSN02 persists on HST04 HSA07 HDN001 HDN002 ISCSI07 DAS01 HSA08 HDN003 HDN004 ISCSI08 Cluster fails over to HST04 HSA09 HDN005 HDN006 ISCSI09 DAS02 HSA10 HDN007 HDN008 ISCSI10 HST04 already “warm” so Failover is fast HSA11 HDN009 HDN010 ISCSI11 DAS03 HSA12 HDN011 HDN012 ISCSI12

Data Node Failure Data node fails WFOHST02 HST03 HHN01 HSN01 HMN01
Data node does not fail over HST04 HSA07 HDN001 HDN002 ISCSI07 DAS01 Hadoop data replication ensures data is available on other data nodes HSA08 HDN003 HDN004 ISCSI08 HSA09 HDN005 HDN006 ISCSI09 DAS02 ISCSI VM does not fail over HSA10 HDN007 HDN008 ISCSI10 HSA11 HDN009 HDN010 ISCSI11 DAS03 Replication is relied upon for availability HSA12 HDN011 HDN012 ISCSI12

HDInsight Tooling

Loading Data into HDFS

Loading Data From Flat Files Designed for Map Job Developer Dashboard
WebHDFS Hive Batch loading Small files (<20MB) Medium files From HDFS

Loading Data From a Database Designed for Polybase SQOOP
Hybrid integration Database integration If your source database is PDW, then use Polybase not SQOOP

Loading Data with Hive Moves the data into the table
LOAD DATA LOAD DATA LOCAL INSERT OVERWRITE INSERT INFO Moves the data into the table Copies the data into the table Overwrites existing data in table or partition Appends to table Data must already exist inside HDFS to load data with Hive

Working with Hive

Creating a Database in Hive
CREATE DATABASE [IF NOT EXISTS] db_name; [COMMENT database_comment] [LOCATION hdfs_path] [WITH DBPROPERTIES (property_name=property_value, ...)];

Creating a Database in Hive
Only Create Database is required If you specify a location you must have created the location first If you don’t specify a location Hadoop will create the database in the /hive/ folder You can create properties but you cannot remove them

Creating a Hive Table Create Table Create External Table Data Type
Sorting / Bucketing Comment (table & column) Row Format Partitions Storage Format Clustering Specify an alternate location in HDFS When dropped only the definition is dropped not the data

Creating Hive Tables Create Table as Select Create Table Like
Defines and Populates table with SELECT Cannot be partitioned Cannot be an external table Copies the table definition without copying the data Can create a table based on a view definition

Querying Data Looks and behaves like SQL, but not SQL
HQL – not SQL Hive does not offer guarantees of RDBMS No Updates or Delete Hive queries data inside HDFS only Whole Partitions can be re-written

What is Polybase?

PolyBase unites UNSTRUCTURED DATA STRUCTURED DATA BUSINESS DATA
…for a better together world of analytics

Agnostic architecture
PolyBase is agnostic = No vendor lock in PolyBase supports Hadoop on Linux & Windows PolyBase integrates with the cloud PolyBase supports HDInsight in APS & external Hadoop clusters

What’s the sweet spot for PolyBase?
Consumer Analyst Scientist Data Volume Medium to Low Reasonable High -> Huge Degree of Structure Very High Some Low ->None Number of Users Medium Low Transformation Complexity Medium to High High Analytics Complexity Partial fit for PolyBase today Structure possibly absent on data Good option for data delivery & transform Great for PolyBase Structured Data Iterative Query Response Hybrid Queries across sources Good for PolyBase High Structured Data Fast interactive response

PolyBase builds the bridge
Just-in-Time data integration Across relational and non-relational data High performance parallel architecture Fast, simple data loading Best of both worlds Uses computational power at source for both relational data & Hadoop Opportunity for new types of analysis Uses existing analytical skills Familiar SQL semantics & behaviour Query with familiar tools SSDT PolyBase = run time integration Includes Power BI

So what is PolyBase? Answer: Component of the PDW Region in APS
Highly parallelised distributed query engine accessing heterogeneous data via SQL Answer: Unique Innovative Technology Answer: Seamless Integration

Any data in any format

Deployment choices Hortonworks Hadoop On Windows (External)
Hadoop On Linux (External) Cloudera CDH On Linux (External) HDInsight On APS (Internal) HDInsight On WASB (External)

External tables Metadata used to describe external data
Enables data access outside the PDW region Never hold data Do not delete data when dropped

Create external table CREATE EXTERNAL TABLE [dbo].[Sales]
([ProductKey] int NOT NULL ,[StoreKey] int NOT NULL ,[DateKey] int NOT NULL ,[CustomerKey] int NOT NULL ,[PromotionKey] int NOT NULL ,[OrderQuantity] int NOT NULL ,[UnitPrice] money NOT NULL ,[SalesAmount] money NOT NULL )

External tables WITH (LOCATION='hdfs://filepath_or_directory'
,DATA_SOURCE = MyDataSourceName ,FILE_FORMAT = MyFileFormatName ,REJECT_TYPE = VALUE ,REJECT_VALUE = 0 ,REJECT_SAMPLE_VALUE = 1000 );

Parallel data transfer

Parallel transfer concepts
Maximize Throughput Every compute node in PDW sees every data node in Hadoop Ensure direct connections are established between all scale out nodes of PDW & Hadoop Balanced Execution Ensure all nodes are equally busy when reading and writing data

Maximizing throughput
CTL01 CMP01 CMP02 CMP03 CMP04 CMP05 CMP06 HHN01 HDN001 HDN002 HDN003 HDN004 HDN005 HDN006 HDN007 HDN008 HDN009 HDN010 HDN011 HDN012

PolyBase and DMS Implemented as a DMS extension
A new bridge component has been added to DMS Bridge supports pluggable interfaces for heterogeneous data access Bridge abstracts the complexity of Hadoop A Java Native Interface (JNI) layer provides interoperability with the rest of DMS DMS shrink wraps HDFS Bridge with new “external” movement types

PolyBase External Table PDW Engine Service PDW Bridge User Perspective
External Data Source External File Format Systems Perspective PDW Engine Service PDW Bridge

Table-level statistics
When an external table is created table level statistics are also persisted as metadata on control node Row count Page count

Table statistics values
Row count 1000 rows Fixed default Page count Based on file size as understood by Hadoop name node Converted to pages Influenced by compression

What are table statistics good for?
File Binding Verifies existence of file/folder Estimate row length & number of rows Sizes the file Split Generation Calculate # of “splits” to allocate per compute node

Data export & data movement

Exporting data with CETAS
CETAS – CREATE EXTERNAL TABLE AS SELECT Post export three statements will be true External table will now exist Data will have been exported Row & page count updated on external table

CETAS: Additional guidance
Integration point is the file system HDFS or WASB[s] Not Hive or HCatalog Target is either a folder or a file Target does not have to already exist External table name must not exist in PDW DB Round-Tripping is perfectly possible PolyBase will make a one-time best effort at clean-up

Hybrid queries

What are hybrid queries?
Read data from multiple external data sources HDFS PDW WASB[S] Hybrid = Multitude of data sources accessed in a single query

External data movement types
Three basic moves mirroring internal movement ExternalRoundRobinMove ExternalShuffleMove ExternalBroadcastMove

ExternalRoundRobinMove
SELECT * FROM dbo.HDFS_Web_Sales Also known as the Random Hash Buffers re-distributed evenly across the compute nodes

ExternalBroadcastMove
Both tables are external to PDW SELECT i_item_id , s_store_id FROM dbo.HDFS_Item CROSS JOIN dbo.HDFS_Store ; An external broadcast move is used as it is cheaper to broadcast immediately than it is to import the data and then broadcast

ExternalShuffleMove SELECT i_item_id , ws_item_sk
, SUM(ws_net_profit) NetProfitCurrentMonth FROM dbo.HDFS_web_sales ws JOIN dbo.date_dim dd ON ws.ws_sold_date_sk = dd.d_date_sk JOIN dbo.item i ON ws.ws_item_sk = i.i_item_sk WHERE dd.d_current_month = 'Y' GROUP BY OPTION (LABEL = 'External Shuffle Move') ; Hybrid Query

Data import & data movement

Return of CTAS Use CTAS to Additional steps included in the MPP plan
Perform a parallel import of data via PolyBase Movement types are the same as hybrid Additional steps included in the MPP plan Persist the results in PDW Check permissions Create extended properties Update Table level Statistics

Importing data with CTAS
CREATE TABLE Agg_ProductProfitCurrentMonth WITH (DISTRIBUTION = HASH(ws_item_sk)) AS SELECT i_item_id , ws_item_sk , SUM(ws_net_profit) NetProfitCurrentMonth FROM dbo.HDFS_web_sales ws JOIN dbo.date_dim dd ON ws.ws_sold_date_sk = dd.d_date_sk JOIN dbo.item i ON ws.ws_item_sk = i.i_item_sk WHERE dd.d_current_month = 'Y' GROUP BY OPTION(LABEL = 'CTAS : External Shuffle Move') ;

Split query execution

Split query processing
PDW Engine Service (Powered with PolyBase) Data Import & Export Generate MapReduce Jobs Bridge (DMS) Job Submission (Hadoop) Maybe as a result of PolyBase MR! Transparent & on the fly

Using split query Map Job designed to minimize movement
Push predicates down to remote data store Reduce data volume to transfer

Understanding overheads
Table level stats only give size of table Selectivity of data needs to be considered Map job output must be persisted in Hadoop Need additional data to decide!

Column level statistics
Provides the additional data we need Crucial for cardinality estimation Enabled for External Tables Manual operation CREATE / DROP Only – not Update

Understanding costs Submitting Hadoop jobs is costly
Spin-up time ~20-30 seconds Consequently… If PDW Engine estimates (based on stats) an execution time of less than seconds there will be no push down

Pushdown trigger point
Push down will not be considered for: Data Transfers < 1GB per distribution Faster to simply import the data New AU2 Query Hint OPTION (FORCE | DISABLE EXTERNALPUSHDOWN)

Selection: filter rows
FROM HDFS_Customer c WHERE c.account_balance < 20000 Push-able SELECT * FROM HDFS_Customer c WHERE c.JobTitle IN ('Developer', 'Tester') Not Push-able SELECT * FROM HDFS_Clickstream c WHERE c.IP_address BETWEEN AND Possibly Push-able

Projection: filter columns
SELECT c.ac FROM HDFS_Customer c Simple Projection SELECT c.first_name+' '+c.last_name FROM HDFS_Customer c Pushdown Projection SELECT c.first_name , c.last_name , c.first_name+' '+c.last_name FROM HDFS_Clickstream c Not Pushed Projection

Microsoft Analytics Platform System
9/20/2018 © 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION. © 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.

Microsoft Analytics Platform System

Similar presentations

Presentation on theme: "Microsoft Analytics Platform System"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Microsoft Analytics Platform System

Similar presentations

Presentation on theme: "Microsoft Analytics Platform System"— Presentation transcript:

Similar presentations

About project

Feedback