Download presentation
Presentation is loading. Please wait.
Published byRandolf Cole Modified over 9 years ago
1
Danny Tambs Solution Architect
5
VOLUME (Size) VARIETY (Structure) VELOCITY (Speed)
9
Distributed Storage (HDFS) Distributed Processing (Map Reduce) World’s Data (Azure Data Marketplace) Windows Azure Storage
12
CONTROL RACK DATA RACK Control Node Mgmt. Node LZ Backup Node Infiniband & Ethernet Fiber Channel RACK 1 Infiniband & Ethernet Per RACK Details 128 cores on 8 compute nodes 2TB of RAM on compute Up to 168 TB of temp DB Up to 1PB of user data Per RACK details 160 cores on 10 compute nodes 1.28 TB of RAM on compute Up to 30 TB of temp DB Up to 150 TB of user data
13
Host 1 (Redundant) Host 0 Host 2 Host 3 JBOD IB & Ethernet Direct attached SAS Built in block fashion (Scale Units) to support easy scale-out. From ¼ Rack system to 7 Racks. One standard node type. 256GB Ram per node. Using SQL Server 2012 underneath. Infiniband Connectivity between nodes. Moving from SAN to JBODs Significant reduction in costs Leverage Windows Server 2012 technologies to achieve the same level of reliability and robustness. Scale unit concept Capacity scale unit: adding 2/3 compute nodes and related storage Spare scale unit Base scale unit: min populated rack w/ networking
14
DELL BaseComputeSpareTotalRaw disk: 1TBRaw disk: 3TBCapacity Quarter-rack131522.6567.95340 TB 2 thirds161845.3135.9680 TB Full rack1911167.95203.851019 TB One and third21221590.6271.81359 TB One and 2 third215218113.25339.751699 TB 2 racks218221135.9407.72039 TB 2 and a third321325158.55475.652378 TB 2 and 2 thirds324328181.2543.62718 TB Three racks327331203.85611.553058 TB Four racks436441271.8815.44077 TB Five racks545551339.751019.255096 TB Six racks654661407.71223.1 6116 TB HPBaseComputeSpareTotalRaw disk: 1TBRaw disk: 3TBCapacity Quarter-rack121415.145.3227 TB Half141630.290.6453 TB Three-quarters161845.3135.9680 TB Full rack1811060.4181.2906 TB One-&-quarter21021375.5226.51133 TB One-&-half21221590.6271.81359 TB Two racks216219120.8362.41812 TB Two and a half3203241514532265 TB Three racks324328181.2543.62718 TB Four racks432437241.6724.83624 TB Five racks5405463029064530 TB Six racks648655362.41087.25436 TB Seven racks756764422.81268.4 6342 TB 2 to 56 nodes Up to 6 PB user data 2 (HP) or 3 (DELL) node increments for small topologies
15
… Control Node [Shell DB] Compute Node 1 Compute Node 2 Compute Node n DSQL plan ‘Optimizable query’ Plan Injection DMS op (e.g. SELECT)
16
Date Dim Date Dim ID Calendar Year Calendar Qtr Calendar Mo Calendar Day Date Dim ID Calendar Year Calendar Qtr Calendar Mo Calendar Day Store Dim Store Dim ID Store Name Store Mgr Store Size Store Dim ID Store Name Store Mgr Store Size Item Dim Prod Dim ID Prod Category Prod Sub Cat Prod Desc Prod Dim ID Prod Category Prod Sub Cat Prod Desc Sales Fact Date Dim ID Store Dim ID Prod Dim ID Mktg Camp Id Qty Sold Dollars Sold Promo Dim Mktg Camp ID Camp Name Camp Mgr Camp Start Camp End ID DD SD ID PD SF 2 SF 2 DD SD ID PD SF 3 SF 3 DD SD ID PD SF 4 SF 4 DD SD ID PD SF 5 SF 5 DD SD PD SF 1 SF 1 SMP System Compute Nodes - VHDX PDW Table Distribution Definition
17
CREATE TABLE myTable (column Defs) WITH ( DISTRIBUTION = HASH (id)); PDW Node 1 Create Table _a Create Table _b … Create Table _h 8 Tables per Node PDW Node 2 Create Table _a Create Table _b … Create Table _h PDW Node 8 Create Table _a Create Table _b … Create Table _h PDW Node …
19
Social Apps Sensor & RFID Mobile Apps Web Apps Non-relational data Relational data Traditional schema- based DW applications Enhanced PDW query engine Data Scientists BI Users DB Admins Regular T-SQL Results PDW V2 Hadoop External Table
20
Parallel Data Transfers PDW Appliance Hadoop cluster Hadoop Cluster
21
CREATE EXTERNAL TABLE table_name ({ } [,...n ]) {WITH (LOCATION =‘ ’,[FORMAT_OPTIONS = ( )])} [;] Indicates ‘External’ Table 1. Required location of Hadoop cluster and file 2. Optional Format Options associated with data import from HDFS 3.
22
:: = [,FIELD_TERMINATOR= ‘Value’], [,STRING_DELIMITER = ‘Value’], [,DATE_FORMAT = ‘Value’], [,REJECT_TYPE = ‘Value’], [,REJECT_VALUE = ‘Value’] [,REJECT_SAMPLE_VALUE = ‘Value’,], [USE_TYPE_DEFAULT = ‘Value’] FIELD_TERMINATOR to indicate a column delimiter STRING_DELIMITER to specify the delimiter for string data type fields DATE_FORMAT for specifying a particular date format REJECT_TYPE for specifying the type of rejection, either value or percentage REJECT_SAMPLE_VALUE for specifying the sample set – for reject type percentage REJECT_VALUE for specifying a particular value/threshold for rejected rows USE_TYPE_DEFAULT for specifying how missing entries in text files are treated
23
Concept of External Tables
24
Non-Relational data Direct and parallelized HDFS access Enhancing PDW’s Data Movement Service (DMS) to allow direct communication between HDFS data nodes and PDW compute nodes HDFS data nodes Social Apps Sensor & RFID Mobile Apps Web Apps Relational data Traditional schema- based DW applications Enhanced PDW query engine Regular T-SQL Results PDW V2 HDFS bridge External Table
25
CREATE EXTERNAL TABLE Statement Tabular view on hdfs://../employee.tbl HDFS bridge process part of DMS process File binding & split generation Hadoop Name Node maintains metadata (file location, file size …) Parsing of format options Parser process part of ‘regular’ T-SQL parsing process
26
Control Node [Shell DB] Compute Node 1 Compute Node 2 … Compute Node n SHELL-only plan CREATE EXTERNAL TABLE No actual physical tables on compute nodes Hadoop Name Node External Table Shell Object
27
Querying non-relational data in HDFS via T- SQL
28
I. Query data in HDFS and display results in table form (via external tables) II. Join data from HDFS with relational PDW data Running Example – Creating external table ‘ClickStream’: CREATE EXTERNAL TABLE ClickStream(url varchar(50), event_date date, user_IP varchar(50)), WITH (LOCATION =‘hdfs://MyHadoop:8020/tpch1GB/employee.tbl’, FORMAT_OPTIONS (FIELD_TERMINATOR = '|')); Text file in HDFS with | as field delimiter Query Examples SELECT top 10 (url) FROM ClickStream where user_IP = ‘192.168.0.1’ Filter query against data in HDFS 1. SELECT url.description FROM ClickStream cs, Url_Descr* url WHERE cs.url = url.name and cs.url=’www.cars.com’; Join data from various files in HDFS (*Url_Descr is a second text file) 2. SELECT user_name FROM ClickStream cs, User* u WHERE cs.user_IP = u.user_IP and cs.url=’www.microsoft.com’; 3. Join data from HDFS with data in PDW (*User is a distributed PDW table)
29
Enhanced PDW query engine SELECT Results External Table DMS Reader 1 DMS Reader N … HDFS bridge Non-Relational data HDFS data nodes Social Apps Sensor & RFID Mobile Apps Web Apps Relational data Traditional schema-based DW applications PDW V2 Parallel HDFS Reads Parallel Importing
30
External Table Shell Object Control Node [Shell DB] Compute Node 1 … Compute Node n DSQL plan with external DMS move SELECT FROM EXTERNAL TABLE Hadoop Data Node 1 Hadoop Data Node n … Plan Injection HFDS Readers
31
Parallel Import of HDFS data & Export into HDFS
32
CREATE TABLE ClickStream_PDW WITH DISTRIBUTION = HASH(url) AS SELECT url, event_date, user_IP FROM ClickStream Retrieval of data in HDFS ‘on-the-fly’ Example Enhanced PDW query engine CTAS Results External Table DMS Reader 1 DMS Reader N … HDFS bridge Non-Relational data HDFS data nodes Social Apps Sensor & RFID Mobile Apps Web Apps Relational data Traditional schema-based DW applications PDW V2 Parallel HDFS Reads Parallel Importing
33
CREATE EXTERNAL TABLE ClickStream WITH(LOCATION =‘hdfs://MyHadoop:5000/users/outputDir’,FORMAT_OPTIONS (FIELD_TERMINATOR = '|')) AS SELECT url, event_date, user_IP FROM ClickStream_PDW Example Enhanced PDW query engine CETASResults External Table HDFS Writer N … HDFS bridge Non-relational data HDFS data nodes Social Apps Sensor & RFID Mobile Apps Web Apps Parallel HDFS Writes Relational data Traditional schema-based DW applications PDW V2 Parallel Exporting Retrieval of PDW data HDFS Writer 1
34
CREATE EXTERNAL TABLE ClickStream WITH (LOCATION =‘hdfs://MyHadoop:5000/users/outputDir’, FORMAT_OPTIONS (FIELD_TERMINATOR = '|')) AS SELECT url, event_date,user_IP FROM ClickStream_PDW Example Output directory in HDFS 2. PDW table (can be either distributed or replicated) 1.
35
CREATE EXTERNAL TABLE ClickStream_UserAnalytics WITH (LOCATION =‘hdfs://MyHadoop:5000/users/outputDir’, FORMAT_OPTIONS (FIELD_TERMINATOR = '|')) AS SELECT user_name, user_location, event_date, user_IP FROM ClickStream c, User_PDW u where c.user_id = u.user_ID Example External table referring to data in HDFS 1. New external table created with results of the join 3. PDW data 2. Joining incoming data from HDFS with PDW data 2.
37
Configuration & Prerequisites for enabling Polybase
39
http://msdn.microsoft.com/en-au/ http://www.microsoftvirtualacademy.com/ http://channel9.msdn.com/Events/TechEd/Australia/2013 http://technet.microsoft.com/en-au/
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.