Presentation is loading. Please wait.

Presentation is loading. Please wait.

Microsoft Analytics Platform System

Similar presentations


Presentation on theme: "Microsoft Analytics Platform System"— Presentation transcript:

1 Microsoft Analytics Platform System
05 – HDI Region Software, Tools & Polybase Brian Walker | Microsoft ​Architect – Data Insights COE Jesse Fountain | Microsoft ​WW TSP Lead September 20, 2018

2 Agenda Region Overview High Availability Tooling Data Loading Hive
Polybase

3 Region Overview

4 Hadoop Region & HDInsight
HDInsight = Microsoft branding Hortonworks distribution (HDP) Hadoop region based on HDP 2.0 Basic authentication High Availability built into all nodes

5 Supported Projects Hadoop Core (HDFS & MapReduce) Oozie Hive Templeton
Pig Sqoop

6 Hadoop Region Architecture

7 Nodes Hadoop Region Dependency Nodes Head Node Secure Gateway
Management Node Data Node PDW Control Node Active Directory Virtual Machine Manager

8 HDInsight APIs WebHDFS – Remote HDFS file system management
WebHCat – Remote job submission and monitoring Oozie – Remote workflow submission and scheduling HiveServer2 – ODBC Connectivity to Hive

9 Hive ODBC Connector (Excel)
Setup same as cloud HDI Secure Node Cluster IP Port 443 Externally trusted certificate required

10 High Availability

11 Hadoop Region HA Orchestration and passive host failover behaves the same as for PDW Data nodes are different APS relies on Hadoop data replication for data availability Disks are not mirrored Data nodes do not failover Replication factor is configurable #Scale Units Replication Factor Polybase =1 2 3 >1

12 HDI Name Node Failure HHN01 node marked as failed
WFOHST02 HST03 HHN01 HSN01 HMN01 HST04 HMN01 HHN01 HSN01 HSN02 persists on HST04 HSA07 HDN001 HDN002 ISCSI07 DAS01 HSA08 HDN003 HDN004 ISCSI08 Cluster fails over to HST04 HSA09 HDN005 HDN006 ISCSI09 DAS02 HSA10 HDN007 HDN008 ISCSI10 HST04 already “warm” so Failover is fast HSA11 HDN009 HDN010 ISCSI11 DAS03 HSA12 HDN011 HDN012 ISCSI12

13 Data Node Failure Data node fails WFOHST02 HST03 HHN01 HSN01 HMN01
Data node does not fail over HST04 HSA07 HDN001 HDN002 ISCSI07 DAS01 Hadoop data replication ensures data is available on other data nodes HSA08 HDN003 HDN004 ISCSI08 HSA09 HDN005 HDN006 ISCSI09 DAS02 ISCSI VM does not fail over HSA10 HDN007 HDN008 ISCSI10 HSA11 HDN009 HDN010 ISCSI11 DAS03 Replication is relied upon for availability HSA12 HDN011 HDN012 ISCSI12

14 HDInsight Tooling

15

16

17 Loading Data into HDFS

18 Loading Data From Flat Files Designed for Map Job Developer Dashboard
WebHDFS Hive Batch loading Small files (<20MB) Medium files From HDFS

19 Loading Data From a Database Designed for Polybase SQOOP
Hybrid integration Database integration If your source database is PDW, then use Polybase not SQOOP

20 Loading Data with Hive Moves the data into the table
LOAD DATA LOAD DATA LOCAL INSERT OVERWRITE INSERT INFO Moves the data into the table Copies the data into the table Overwrites existing data in table or partition Appends to table Data must already exist inside HDFS to load data with Hive

21 Working with Hive

22 Creating a Database in Hive
CREATE DATABASE [IF NOT EXISTS] db_name; [COMMENT database_comment] [LOCATION hdfs_path] [WITH DBPROPERTIES (property_name=property_value, ...)];

23 Creating a Database in Hive
Only Create Database is required If you specify a location you must have created the location first If you don’t specify a location Hadoop will create the database in the /hive/ folder You can create properties but you cannot remove them

24 Creating a Hive Table Create Table Create External Table Data Type
Sorting / Bucketing Comment (table & column) Row Format Partitions Storage Format Clustering Specify an alternate location in HDFS When dropped only the definition is dropped not the data

25 Creating Hive Tables Create Table as Select Create Table Like
Defines and Populates table with SELECT Cannot be partitioned Cannot be an external table Copies the table definition without copying the data Can create a table based on a view definition

26 Querying Data Looks and behaves like SQL, but not SQL
HQL – not SQL Hive does not offer guarantees of RDBMS No Updates or Delete Hive queries data inside HDFS only Whole Partitions can be re-written

27 What is Polybase?

28 PolyBase unites UNSTRUCTURED DATA STRUCTURED DATA BUSINESS DATA
…for a better together world of analytics

29 Agnostic architecture
PolyBase is agnostic = No vendor lock in PolyBase supports Hadoop on Linux & Windows PolyBase integrates with the cloud PolyBase supports HDInsight in APS & external Hadoop clusters

30 What’s the sweet spot for PolyBase?
Consumer Analyst Scientist Data Volume Medium to Low Reasonable High -> Huge Degree of Structure Very High Some Low ->None Number of Users Medium Low Transformation Complexity Medium to High High Analytics Complexity Partial fit for PolyBase today Structure possibly absent on data Good option for data delivery & transform Great for PolyBase Structured Data Iterative Query Response Hybrid Queries across sources Good for PolyBase High Structured Data Fast interactive response

31 PolyBase builds the bridge
Just-in-Time data integration Across relational and non-relational data High performance parallel architecture Fast, simple data loading Best of both worlds Uses computational power at source for both relational data & Hadoop Opportunity for new types of analysis Uses existing analytical skills Familiar SQL semantics & behaviour Query with familiar tools SSDT PolyBase = run time integration Includes Power BI

32 So what is PolyBase? Answer: Component of the PDW Region in APS
Highly parallelised distributed query engine accessing heterogeneous data via SQL Answer: Unique Innovative Technology Answer: Seamless Integration

33 Any data in any format

34 Deployment choices Hortonworks Hadoop On Windows (External)
Hadoop On Linux (External) Cloudera CDH On Linux (External) HDInsight On APS (Internal) HDInsight On WASB (External)

35 External tables Metadata used to describe external data
Enables data access outside the PDW region Never hold data Do not delete data when dropped

36 Create external table CREATE EXTERNAL TABLE [dbo].[Sales]
([ProductKey] int NOT NULL ,[StoreKey] int NOT NULL ,[DateKey] int NOT NULL ,[CustomerKey] int NOT NULL ,[PromotionKey] int NOT NULL ,[OrderQuantity] int NOT NULL ,[UnitPrice] money NOT NULL ,[SalesAmount] money NOT NULL )

37 External tables WITH (LOCATION='hdfs://filepath_or_directory'
,DATA_SOURCE = MyDataSourceName ,FILE_FORMAT = MyFileFormatName ,REJECT_TYPE = VALUE ,REJECT_VALUE = 0 ,REJECT_SAMPLE_VALUE = 1000 );

38 Parallel data transfer

39 Parallel transfer concepts
Maximize Throughput Every compute node in PDW sees every data node in Hadoop Ensure direct connections are established between all scale out nodes of PDW & Hadoop Balanced Execution Ensure all nodes are equally busy when reading and writing data

40 Maximizing throughput
CTL01 CMP01 CMP02 CMP03 CMP04 CMP05 CMP06 HHN01 HDN001 HDN002 HDN003 HDN004 HDN005 HDN006 HDN007 HDN008 HDN009 HDN010 HDN011 HDN012

41 PolyBase and DMS Implemented as a DMS extension
A new bridge component has been added to DMS Bridge supports pluggable interfaces for heterogeneous data access Bridge abstracts the complexity of Hadoop A Java Native Interface (JNI) layer provides interoperability with the rest of DMS DMS shrink wraps HDFS Bridge with new “external” movement types

42 PolyBase External Table PDW Engine Service PDW Bridge User Perspective
External Data Source External File Format Systems Perspective PDW Engine Service PDW Bridge

43 Table-level statistics
When an external table is created table level statistics are also persisted as metadata on control node Row count Page count

44 Table statistics values
Row count 1000 rows Fixed default Page count Based on file size as understood by Hadoop name node Converted to pages Influenced by compression

45 What are table statistics good for?
File Binding Verifies existence of file/folder Estimate row length & number of rows Sizes the file Split Generation Calculate # of “splits” to allocate per compute node

46 Data export & data movement

47 Exporting data with CETAS
CETAS – CREATE EXTERNAL TABLE AS SELECT Post export three statements will be true External table will now exist Data will have been exported Row & page count updated on external table

48 CETAS: Additional guidance
Integration point is the file system HDFS or WASB[s] Not Hive or HCatalog Target is either a folder or a file Target does not have to already exist External table name must not exist in PDW DB Round-Tripping is perfectly possible PolyBase will make a one-time best effort at clean-up

49 Hybrid queries

50 What are hybrid queries?
Read data from multiple external data sources HDFS PDW WASB[S] Hybrid = Multitude of data sources accessed in a single query

51 External data movement types
Three basic moves mirroring internal movement ExternalRoundRobinMove ExternalShuffleMove ExternalBroadcastMove

52 ExternalRoundRobinMove
SELECT * FROM dbo.HDFS_Web_Sales Also known as the Random Hash Buffers re-distributed evenly across the compute nodes

53 ExternalBroadcastMove
Both tables are external to PDW SELECT i_item_id , s_store_id FROM dbo.HDFS_Item CROSS JOIN dbo.HDFS_Store ; An external broadcast move is used as it is cheaper to broadcast immediately than it is to import the data and then broadcast

54 ExternalShuffleMove SELECT i_item_id , ws_item_sk
, SUM(ws_net_profit) NetProfitCurrentMonth FROM dbo.HDFS_web_sales ws JOIN dbo.date_dim dd ON ws.ws_sold_date_sk = dd.d_date_sk JOIN dbo.item i ON ws.ws_item_sk = i.i_item_sk WHERE dd.d_current_month = 'Y' GROUP BY OPTION (LABEL = 'External Shuffle Move') ; Hybrid Query

55 Data import & data movement

56 Return of CTAS Use CTAS to Additional steps included in the MPP plan
Perform a parallel import of data via PolyBase Movement types are the same as hybrid Additional steps included in the MPP plan Persist the results in PDW Check permissions Create extended properties Update Table level Statistics

57 Importing data with CTAS
CREATE TABLE Agg_ProductProfitCurrentMonth WITH (DISTRIBUTION = HASH(ws_item_sk)) AS SELECT i_item_id , ws_item_sk , SUM(ws_net_profit) NetProfitCurrentMonth FROM dbo.HDFS_web_sales ws JOIN dbo.date_dim dd ON ws.ws_sold_date_sk = dd.d_date_sk JOIN dbo.item i ON ws.ws_item_sk = i.i_item_sk WHERE dd.d_current_month = 'Y' GROUP BY OPTION(LABEL = 'CTAS : External Shuffle Move') ;

58 Split query execution

59 Split query processing
PDW Engine Service (Powered with PolyBase) Data Import & Export Generate MapReduce Jobs Bridge (DMS) Job Submission (Hadoop) Maybe as a result of PolyBase MR! Transparent & on the fly

60 Using split query Map Job designed to minimize movement
Push predicates down to remote data store Reduce data volume to transfer

61 Understanding overheads
Table level stats only give size of table Selectivity of data needs to be considered Map job output must be persisted in Hadoop Need additional data to decide!

62 Column level statistics
Provides the additional data we need Crucial for cardinality estimation Enabled for External Tables Manual operation CREATE / DROP Only – not Update

63 Understanding costs Submitting Hadoop jobs is costly
Spin-up time ~20-30 seconds Consequently… If PDW Engine estimates (based on stats) an execution time of less than seconds there will be no push down

64 Pushdown trigger point
Push down will not be considered for: Data Transfers < 1GB per distribution Faster to simply import the data New AU2 Query Hint OPTION (FORCE | DISABLE EXTERNALPUSHDOWN)

65 Selection: filter rows
FROM HDFS_Customer c WHERE c.account_balance < 20000 Push-able SELECT * FROM HDFS_Customer c WHERE c.JobTitle IN ('Developer', 'Tester') Not Push-able SELECT * FROM HDFS_Clickstream c WHERE c.IP_address BETWEEN AND Possibly Push-able

66 Projection: filter columns
SELECT c.ac FROM HDFS_Customer c Simple Projection SELECT c.first_name+' '+c.last_name FROM HDFS_Customer c Pushdown Projection SELECT c.first_name , c.last_name , c.first_name+' '+c.last_name FROM HDFS_Clickstream c Not Pushed Projection

67 Microsoft Analytics Platform System
9/20/2018 © 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION. © 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.


Download ppt "Microsoft Analytics Platform System"

Similar presentations


Ads by Google