Microsoft Analytics Platform System 05 – HDI Region Software, Tools & Polybase Brian Walker | Microsoft Architect – Data Insights COE Jesse Fountain | Microsoft WW TSP Lead September 20, 2018
Agenda Region Overview High Availability Tooling Data Loading Hive Polybase
Region Overview
Hadoop Region & HDInsight HDInsight = Microsoft branding Hortonworks distribution (HDP) Hadoop region based on HDP 2.0 Basic authentication High Availability built into all nodes
Supported Projects Hadoop Core (HDFS & MapReduce) Oozie Hive Templeton Pig Sqoop
Hadoop Region Architecture
Nodes Hadoop Region Dependency Nodes Head Node Secure Gateway Management Node Data Node PDW Control Node Active Directory Virtual Machine Manager
HDInsight APIs WebHDFS – Remote HDFS file system management WebHCat – Remote job submission and monitoring Oozie – Remote workflow submission and scheduling HiveServer2 – ODBC Connectivity to Hive
Hive ODBC Connector (Excel) Setup same as cloud HDI Secure Node Cluster IP Port 443 Externally trusted certificate required
High Availability
Hadoop Region HA Orchestration and passive host failover behaves the same as for PDW Data nodes are different APS relies on Hadoop data replication for data availability Disks are not mirrored Data nodes do not failover Replication factor is configurable #Scale Units Replication Factor Polybase =1 2 3 >1
HDI Name Node Failure HHN01 node marked as failed WFOHST02 HST03 HHN01 HSN01 HMN01 HST04 HMN01 HHN01 HSN01 HSN02 persists on HST04 HSA07 HDN001 HDN002 ISCSI07 DAS01 HSA08 HDN003 HDN004 ISCSI08 Cluster fails over to HST04 HSA09 HDN005 HDN006 ISCSI09 DAS02 HSA10 HDN007 HDN008 ISCSI10 HST04 already “warm” so Failover is fast HSA11 HDN009 HDN010 ISCSI11 DAS03 HSA12 HDN011 HDN012 ISCSI12
Data Node Failure Data node fails WFOHST02 HST03 HHN01 HSN01 HMN01 Data node does not fail over HST04 HSA07 HDN001 HDN002 ISCSI07 DAS01 Hadoop data replication ensures data is available on other data nodes HSA08 HDN003 HDN004 ISCSI08 HSA09 HDN005 HDN006 ISCSI09 DAS02 ISCSI VM does not fail over HSA10 HDN007 HDN008 ISCSI10 HSA11 HDN009 HDN010 ISCSI11 DAS03 Replication is relied upon for availability HSA12 HDN011 HDN012 ISCSI12
HDInsight Tooling
Loading Data into HDFS
Loading Data From Flat Files Designed for Map Job Developer Dashboard WebHDFS Hive Batch loading Small files (<20MB) Medium files From HDFS
Loading Data From a Database Designed for Polybase SQOOP Hybrid integration Database integration If your source database is PDW, then use Polybase not SQOOP
Loading Data with Hive Moves the data into the table LOAD DATA LOAD DATA LOCAL INSERT OVERWRITE INSERT INFO Moves the data into the table Copies the data into the table Overwrites existing data in table or partition Appends to table Data must already exist inside HDFS to load data with Hive
Working with Hive
Creating a Database in Hive CREATE DATABASE [IF NOT EXISTS] db_name; [COMMENT database_comment] [LOCATION hdfs_path] [WITH DBPROPERTIES (property_name=property_value, ...)];
Creating a Database in Hive Only Create Database is required If you specify a location you must have created the location first If you don’t specify a location Hadoop will create the database in the /hive/ folder You can create properties but you cannot remove them
Creating a Hive Table Create Table Create External Table Data Type Sorting / Bucketing Comment (table & column) Row Format Partitions Storage Format Clustering Specify an alternate location in HDFS When dropped only the definition is dropped not the data
Creating Hive Tables Create Table as Select Create Table Like Defines and Populates table with SELECT Cannot be partitioned Cannot be an external table Copies the table definition without copying the data Can create a table based on a view definition
Querying Data Looks and behaves like SQL, but not SQL HQL – not SQL Hive does not offer guarantees of RDBMS No Updates or Delete Hive queries data inside HDFS only Whole Partitions can be re-written
What is Polybase?
PolyBase unites UNSTRUCTURED DATA STRUCTURED DATA BUSINESS DATA …for a better together world of analytics
Agnostic architecture PolyBase is agnostic = No vendor lock in PolyBase supports Hadoop on Linux & Windows PolyBase integrates with the cloud PolyBase supports HDInsight in APS & external Hadoop clusters
What’s the sweet spot for PolyBase? Consumer Analyst Scientist Data Volume Medium to Low Reasonable High -> Huge Degree of Structure Very High Some Low ->None Number of Users Medium Low Transformation Complexity Medium to High High Analytics Complexity Partial fit for PolyBase today Structure possibly absent on data Good option for data delivery & transform Great for PolyBase Structured Data Iterative Query Response Hybrid Queries across sources Good for PolyBase High Structured Data Fast interactive response
PolyBase builds the bridge Just-in-Time data integration Across relational and non-relational data High performance parallel architecture Fast, simple data loading Best of both worlds Uses computational power at source for both relational data & Hadoop Opportunity for new types of analysis Uses existing analytical skills Familiar SQL semantics & behaviour Query with familiar tools SSDT PolyBase = run time integration Includes Power BI
So what is PolyBase? Answer: Component of the PDW Region in APS Highly parallelised distributed query engine accessing heterogeneous data via SQL Answer: Unique Innovative Technology Answer: Seamless Integration
Any data in any format
Deployment choices Hortonworks Hadoop On Windows (External) Hadoop On Linux (External) Cloudera CDH On Linux (External) HDInsight On APS (Internal) HDInsight On WASB (External)
External tables Metadata used to describe external data Enables data access outside the PDW region Never hold data Do not delete data when dropped
Create external table CREATE EXTERNAL TABLE [dbo].[Sales] ([ProductKey] int NOT NULL ,[StoreKey] int NOT NULL ,[DateKey] int NOT NULL ,[CustomerKey] int NOT NULL ,[PromotionKey] int NOT NULL ,[OrderQuantity] int NOT NULL ,[UnitPrice] money NOT NULL ,[SalesAmount] money NOT NULL )
External tables WITH (LOCATION='hdfs://filepath_or_directory' ,DATA_SOURCE = MyDataSourceName ,FILE_FORMAT = MyFileFormatName ,REJECT_TYPE = VALUE ,REJECT_VALUE = 0 ,REJECT_SAMPLE_VALUE = 1000 );
Parallel data transfer
Parallel transfer concepts Maximize Throughput Every compute node in PDW sees every data node in Hadoop Ensure direct connections are established between all scale out nodes of PDW & Hadoop Balanced Execution Ensure all nodes are equally busy when reading and writing data
Maximizing throughput CTL01 CMP01 CMP02 CMP03 CMP04 CMP05 CMP06 HHN01 HDN001 HDN002 HDN003 HDN004 HDN005 HDN006 HDN007 HDN008 HDN009 HDN010 HDN011 HDN012
PolyBase and DMS Implemented as a DMS extension A new bridge component has been added to DMS Bridge supports pluggable interfaces for heterogeneous data access Bridge abstracts the complexity of Hadoop A Java Native Interface (JNI) layer provides interoperability with the rest of DMS DMS shrink wraps HDFS Bridge with new “external” movement types
PolyBase External Table PDW Engine Service PDW Bridge User Perspective External Data Source External File Format Systems Perspective PDW Engine Service PDW Bridge
Table-level statistics When an external table is created table level statistics are also persisted as metadata on control node Row count Page count
Table statistics values Row count 1000 rows Fixed default Page count Based on file size as understood by Hadoop name node Converted to pages Influenced by compression
What are table statistics good for? File Binding Verifies existence of file/folder Estimate row length & number of rows Sizes the file Split Generation Calculate # of “splits” to allocate per compute node
Data export & data movement
Exporting data with CETAS CETAS – CREATE EXTERNAL TABLE AS SELECT Post export three statements will be true External table will now exist Data will have been exported Row & page count updated on external table
CETAS: Additional guidance Integration point is the file system HDFS or WASB[s] Not Hive or HCatalog Target is either a folder or a file Target does not have to already exist External table name must not exist in PDW DB Round-Tripping is perfectly possible PolyBase will make a one-time best effort at clean-up
Hybrid queries
What are hybrid queries? Read data from multiple external data sources HDFS PDW WASB[S] Hybrid = Multitude of data sources accessed in a single query
External data movement types Three basic moves mirroring internal movement ExternalRoundRobinMove ExternalShuffleMove ExternalBroadcastMove
ExternalRoundRobinMove SELECT * FROM dbo.HDFS_Web_Sales Also known as the Random Hash Buffers re-distributed evenly across the compute nodes
ExternalBroadcastMove Both tables are external to PDW SELECT i_item_id , s_store_id FROM dbo.HDFS_Item CROSS JOIN dbo.HDFS_Store ; An external broadcast move is used as it is cheaper to broadcast immediately than it is to import the data and then broadcast
ExternalShuffleMove SELECT i_item_id , ws_item_sk , SUM(ws_net_profit) NetProfitCurrentMonth FROM dbo.HDFS_web_sales ws JOIN dbo.date_dim dd ON ws.ws_sold_date_sk = dd.d_date_sk JOIN dbo.item i ON ws.ws_item_sk = i.i_item_sk WHERE dd.d_current_month = 'Y' GROUP BY OPTION (LABEL = 'External Shuffle Move') ; Hybrid Query
Data import & data movement
Return of CTAS Use CTAS to Additional steps included in the MPP plan Perform a parallel import of data via PolyBase Movement types are the same as hybrid Additional steps included in the MPP plan Persist the results in PDW Check permissions Create extended properties Update Table level Statistics
Importing data with CTAS CREATE TABLE Agg_ProductProfitCurrentMonth WITH (DISTRIBUTION = HASH(ws_item_sk)) AS SELECT i_item_id , ws_item_sk , SUM(ws_net_profit) NetProfitCurrentMonth FROM dbo.HDFS_web_sales ws JOIN dbo.date_dim dd ON ws.ws_sold_date_sk = dd.d_date_sk JOIN dbo.item i ON ws.ws_item_sk = i.i_item_sk WHERE dd.d_current_month = 'Y' GROUP BY OPTION(LABEL = 'CTAS : External Shuffle Move') ;
Split query execution
Split query processing PDW Engine Service (Powered with PolyBase) Data Import & Export Generate MapReduce Jobs Bridge (DMS) Job Submission (Hadoop) Maybe as a result of PolyBase MR! Transparent & on the fly
Using split query Map Job designed to minimize movement Push predicates down to remote data store Reduce data volume to transfer
Understanding overheads Table level stats only give size of table Selectivity of data needs to be considered Map job output must be persisted in Hadoop Need additional data to decide!
Column level statistics Provides the additional data we need Crucial for cardinality estimation Enabled for External Tables Manual operation CREATE / DROP Only – not Update
Understanding costs Submitting Hadoop jobs is costly Spin-up time ~20-30 seconds Consequently… If PDW Engine estimates (based on stats) an execution time of less than 20-30 seconds there will be no push down
Pushdown trigger point Push down will not be considered for: Data Transfers < 1GB per distribution Faster to simply import the data New AU2 Query Hint OPTION (FORCE | DISABLE EXTERNALPUSHDOWN)
Selection: filter rows FROM HDFS_Customer c WHERE c.account_balance < 20000 Push-able SELECT * FROM HDFS_Customer c WHERE c.JobTitle IN ('Developer', 'Tester') Not Push-able SELECT * FROM HDFS_Clickstream c WHERE c.IP_address BETWEEN 127.0.0.1 AND 127.0.0.7 Possibly Push-able
Projection: filter columns SELECT c.ac FROM HDFS_Customer c Simple Projection SELECT c.first_name+' '+c.last_name FROM HDFS_Customer c Pushdown Projection SELECT c.first_name , c.last_name , c.first_name+' '+c.last_name FROM HDFS_Clickstream c Not Pushed Projection
Microsoft Analytics Platform System 9/20/2018 © 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION. © 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.