Relational and Non-Relational Data Living in Peace and Harmony

Slides:



Advertisements
Similar presentations
Using the SQL Access Advisor
Advertisements

Software Version: DSS ver up01
1. XP 2 * The Web is a collection of files that reside on computers, called Web servers. * Web servers are connected to each other through the Internet.
1 Senn, Information Technology, 3 rd Edition © 2004 Pearson Prentice Hall James A. Senns Information Technology, 3 rd Edition Chapter 7 Enterprise Databases.
Follow the instruction to install the PC Suite from the SD card: 1.Go to the settings -> SD Card & phone storage -> Enable the mass storage only mode 2.Connect.
Client Tools Explained EAE 3014
Advanced Piloting Cruise Plot.
Chapter 1: The Database Environment
Chapter 1 The Study of Body Function Image PowerPoint
19 Copyright © 2005, Oracle. All rights reserved. Distributing Modular Applications: Developing Web Services.
17 Copyright © 2005, Oracle. All rights reserved. Deploying Applications by Using Java Web Start.
1 Chapter 12 File Management Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles,
1 NatQuery 3/05 An End-User Perspective On Using NatQuery To Extract Data From ADABAS Presented by Treehouse Software, Inc.
Database Systems: Design, Implementation, and Management
© Tally Solutions Pvt. Ltd. All Rights Reserved Shoper 9 License Management December 09.
Introduction Lesson 1 Microsoft Office 2010 and the Internet
Configuration management
A presentation by Werardt Systemss P Ltd An Online Machine Monitoring System.
Information Systems Today: Managing in the Digital World
Database Performance Tuning and Query Optimization
Campaign Overview Mailers Mailing Lists
School of Geography FACULTY OF ENVIRONMENT Working with Tables 1.
PEPS Weekly Data Extracts User Guide September 2006.
MySQL Access Privilege System
1 Web-Enabled Decision Support Systems Access Introduction: Touring Access Prof. Name Position (123) University Name.
Microsoft Access.
State of Connecticut Core-CT Project Query 8 hrs Updated 6/06/2006.
Vanderbilt Business Objects Users Group 1 Reporting Techniques & Formatting Beginning & Advanced.
INTRODUCTION TO SIMULATION WITH OMNET++ José Daniel García Sánchez ARCOS Group – University Carlos III of Madrid.
© Copyright by Deitel & Associates, Inc. and Pearson Education Inc. All Rights Reserved. 1 Tutorial 12 – Security Panel Application Introducing.
Benchmark Series Microsoft Excel 2013 Level 2
HORIZONT TWS/WebAdmin TWS/WebAdmin for Distributed
The World Wide Web. 2 The Web is an infrastructure of distributed information combined with software that uses networks as a vehicle to exchange that.
4 Oracle Data Integrator First Project – Simple Transformations: One source, one target 3-1.
Chapter 9: The Client/Server Database Environment
Executional Architecture
Presented by Douglas Greer Creating and Maintaining Business Objects Universes.
Services Course Windows Live SkyDrive Participant Guide.
What’s New in WatchGuard Dimension v1.2
Chapter 2 Entity-Relationship Data Modeling: Tools and Techniques
Chapter 10: The Traditional Approach to Design
Systems Analysis and Design in a Changing World, Fifth Edition
Module 12 WSP quality assurance tool 1. Module 12 WSP quality assurance tool Session structure Introduction About the tool Using the tool Supporting materials.
Chapter 12 Working with Forms Principles of Web Design, 4 th Edition.
12 January 2009SDS batch generation, distribution and web interface 1 ExESS IT tool for SDS batch generation, distribution and web interface ExESS IT tool.
Chapter 13 The Data Warehouse
Installing Windows XP Professional Using Attended Installation Slide 1 of 30Session 8 Ver. 1.0 CompTIA A+ Certification: A Comprehensive Approach for all.
Indexing HDFS Data in PDW: Splitting the data from index 1 Vinitha Gankidi #, Nikhil Teletia *, Jignesh M. Patel #, Alan Halverson *, David J. DeWitt *
Introduction to Costing with PPM Amanda Oliver 2008 PPM User Conference.
South Dakota Library Network MetaLib User Interface South Dakota Library Network 1200 University, Unit 9672 Spearfish, SD © South Dakota.
April 10-12, Chicago, IL Deep Dive into PowerPivot in Office and SharePoint Diego Oppenheimer, Microsoft Kay Unkroth, Microsoft.
April 10-12, Chicago, IL Parallelizing Large Excel-Based Calculations on Windows HPC Server & Azure.
Danny Tambs Solution Architect. VOLUME (Size) VARIETY (Structure) VELOCITY (Speed)
Chapter 7 Managing Data Sources. ASP.NET 2.0, Third Edition2.
10-fold increase in data volume every 5 years “DW has shifted almost entirely towards the appliance model due to speed of the balanced appliance and.
April 10-12, Chicago, IL Drab to Dynamite! Managed Self-Service BI Using Real-World Data Riccardo Muti, Sandy Rivas.
Using Excel, Excel Service and PerformancePoint
April | Chicago, IL Office as your BI Platform Ashvini Sharma Principal Group Manager Microsoft Office Division | Program Management Seayoung Rhee.
April 10-12, Chicago, IL Driving Smarter Decisions with Microsoft Big Data Tim Mallalieu Group Program Manager, HDInsight.
Indexing HDFS Data in PDW: Splitting the data from the index VLDB2014 WSIC、Microsoft Calvin
April 10-12, Chicago, IL Microsoft Data Explorer for Excel Faisal Mohamood, Lead PM, Microsoft.
Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.
PolyBase Query Hadoop with ease Sahaj Saini SQL Server, Microsoft.
PolyBase Query Hadoop with ease Sahaj Saini Program Manager, Microsoft.
Redmond Protocols Plugfest 2016 Casey Karst PolyBase in SQL Server 2016.
Microsoft /2/2018 3:42 PM BRK3129 Query Big Data using the Expanded T-SQL footprint with PolyBase in SQL Server 2016 Casey Karst Program Manager.
The Model Architecture with SQL and Polybase
Solving the Hard Problems
Michelle Haarhues Keeping up with SSMS.
Moving your on-prem data warehouse to cloud. What are your options?
Presentation transcript:

Relational and Non-Relational Data Living in Peace and Harmony Polybase in SQL Server PDW 2012

Please silence cell phones

Agenda Motivation – Why Polybase at all? Concept of External Tables Querying non-relational data in HDFS Parallel data import from HDFS & data export into HDFS Prerequisites & Configuration settings Summary

Motivation – PDW & Hadoop Integration

SQL Server PDW Appliance Shared-Nothing Parallel DBSM Scalable Solution Standards based Pre-packaged

Query Processing in SQL PDW (in a nutshell) User data resides in compute nodes (distributed or replicated); control node obtains metadata Leveraging SQL Server on control node as query processing aid DSQL Plan may include DMS plan for moving data (e.g. for join-incompatible queries) DSQL plan Plan Injection Control Node [Shell DB] Compute Node 1 Compute Node 2 Compute Node n ‘Optimizable query’ DMS op (e.g. SELECT) … Search space = set of all execution plans for an input query Produced serial plan of the SQL Server instance on control node is not enough/optimal as there is no knowledge about the distribution of data > join order may be different if data sets are co-located

New World of Big Data New emerging applications generating massive amount of non-relational data New challenges for advanced data analysis techniques required to integrate relational with non-relational data Social Apps Sensor & RFID Mobile Apps Web Apps Non-Relational data Relational data Traditional schema-based DW applications How to overcome the ‘Impedance Mismatch’? Hadoop RDBMS

Project Polybase Background High-level goals for V2 Close collaboration between Microsoft’s Jim Gray System Lab lead by database pioneer David DeWitt and PDW engineering group High-level goals for V2 Seamless querying of non-relational data in Hadoop via regular T-SQL Enhancing PDW query engine to process data coming from Hadoop Parallelized data import from Hadoop & data export into Hadoop Support of various Hadoop distributions – HDP 1.x on Windows Server, Hortonwork’s HDP 1.x on Linux, and Cloudera’s CHD4.0

Concept of External Tables

Polybase – Enhancing PDW query engine Data Scientists BI Users DB Admins Regular T-SQL Results Relational data Social Apps Sensor & RFID Mobile Apps Web Apps Non-relational data Traditional schema-based DW applications External Table Enhanced PDW query engine Hadoop PDW V2

External Tables Internal representation of data residing in Hadoop/HDFS Only support of delimited text files High-level permissions required for creating external tables ADMINISTER BULK OPERATIONS & ALTER SCHEMA Different than ‘regular SQL tables’ (e.g. no DML support …) Introducing new T-SQL CREATE EXTERNAL TABLE table_name ({<column_definition>} [,...n ]) {WITH (LOCATION =‘<URI>’,[FORMAT_OPTIONS = (<VALUES>)])} [;] Indicates ‘External’ Table 1. Required location of Hadoop cluster and file 2. Optional Format Options associated with data import from HDFS 3.

Format Options FIELD_TERMINATOR STRING_DELIMITER DATE_FORMAT <Format Options> :: = [,FIELD_TERMINATOR= ‘Value’], [,STRING_DELIMITER = ‘Value’], [,DATE_FORMAT = ‘Value’], [,REJECT_TYPE = ‘Value’], [,REJECT_VALUE = ‘Value’] [,REJECT_SAMPLE_VALUE = ‘Value’,], [USE_TYPE_DEFAULT = ‘Value’] FIELD_TERMINATOR to indicate a column delimiter STRING_DELIMITER to specify the delimiter for string data type fields DATE_FORMAT for specifying a particular date format REJECT_TYPE for specifying the type of rejection, either value or percentage REJECT_SAMPLE_VALUE for specifying the sample set – for reject type percentage REJECT_VALUE for specifying a particular value/threshold for rejected rows USE_TYPE_DEFAULT for specifying how missing entries in text files are treated

Traditional schema-based DW applications HDFS Bridge Direct and parallelized HDFS access Enhancing PDW’s Data Movement Service (DMS) to allow direct communication between HDFS data nodes and PDW compute nodes Regular T-SQL Results Non-Relational data Social Apps Sensor & RFID Relational data Traditional schema-based DW applications External Table Mobile Apps Web Apps Enhanced PDW query engine HDFS data nodes HDFS bridge PDW V2

Underneath External Tables – HDFS bridge Statistics generation (estimation) at ‘design time’ Estimation of row length & number of rows (file binding) Calculation of blocks needed per compute node (split generation) Parsing of the format options needed for import HDFS bridge process part of DMS process File binding & split generation Hadoop Name Node maintains metadata (file location, file size …) CREATE EXTERNAL TABLE Statement Parsing of format options Parser process part of ‘regular’ T-SQL parsing process Tabular view on hdfs://../employee.tbl

Summary – External Tables in PDW Query Lifecycle Shell-only execution No actual physical tables created on compute nodes Control node obtains external table object Shell table as any other with the statistic information & format options Hadoop Name Node External Table Shell Object SHELL-only plan No actual physical tables on compute nodes CREATE EXTERNAL TABLE … Control Node [Shell DB] Compute Node 1 Compute Node 2 Compute Node n

Querying non-relational data in HDFS via T-SQL

Querying non-relational data via T-SQL Query data in HDFS and display results in table form (via external tables) Join data from HDFS with relational PDW data Running Example – Creating external table ‘ClickStream’: CREATE EXTERNAL TABLE ClickStream(url varchar(50), event_date date, user_IP varchar(50)), WITH (LOCATION =‘hdfs://MyHadoop:5000/tpch1GB/employee.tbl’, FORMAT_OPTIONS (FIELD_TERMINATOR = '|')); Text file in HDFS with | as field delimiter Query Examples SELECT top 10 (url) FROM ClickStream where user_IP = ‘192.168.0.1’ Filter query against data in HDFS 1. SELECT url.description FROM ClickStream cs, Url_Descr* url WHERE cs.url = url.name and cs.url=’www.cars.com’; Join data from various files in HDFS (*Url_Descr is a second text file) 2. SELECT user_name FROM ClickStream cs, User* u WHERE cs.user_IP = u.user_IP and cs.url=’www.microsoft.com’; 3. Join data from HDFS with data in PDW (*User is a distributed PDW table)

Querying non-relational data – HFDS bridge Data gets imported (moved) ‘on-the-fly’ via parallel HDFS readers Schema validation against stored external table shell objects Data ‘lands’ in temporary tables (Q-tables) for processing Data gets removed after results are returned to the client Non-Relational data Social Apps Sensor & RFID SELECT Results Traditional schema-based DW applications External Table Mobile Apps Web Apps Parallel HDFS Reads Parallel Importing Enhanced PDW query engine HDFS bridge HDFS data nodes DMS Reader 1 DMS Reader N PDW V2 … Relational data

Summary – Querying External Tables DSQL plan with external DMS move Plan Injection SELECT FROM EXTERNAL TABLE External Table Shell Object Compute Node 1 … Compute Node n HFDS Readers Control Node [Shell DB] … Hadoop Data Node 1 Hadoop Data Node n

Parallel Import of HDFS data & Export into HDFS

CTAS - Parallel data import from HDFS into PDW V2 Fully parallelized via CREATE TABLE AS SELECT (CTAS) with external tables as source table and PDW tables (either distributed or replicated) as destination Example Retrieval of data in HDFS ‘on-the-fly’ CREATE TABLE ClickStream_PDW WITH DISTRIBUTION = HASH(url) AS SELECT url, event_date, user_IP FROM ClickStream Non-Relational data Social Apps Relational data Sensor & RFID CTAS Results Traditional schema-based DW applications External Table Mobile Apps Web Apps Parallel HDFS Reads Parallel Importing Enhanced PDW query engine HDFS bridge HDFS data nodes DMS Reader 1 DMS Reader N PDW V2 …

CETAS - Parallel data export from PDW into HDFS Fully parallelized via CREATE EXTERNAL TABLE AS SELECT (CETAS) with external tables as destination table and PDW tables as source Example CREATE EXTERNAL TABLE ClickStream WITH(LOCATION =‘hdfs://MyHadoop:5000/users/outputDir’,FORMAT_OPTIONS (FIELD_TERMINATOR = '|')) AS SELECT url, event_date, user_IP FROM ClickStream_PDW Retrieval of PDW data Non-relational data Relational data Social Apps Sensor & RFID CETAS Results Traditional schema-based DW applications Mobile Apps Web Apps External Table Enhanced PDW query engine Parallel HDFS Writes Parallel Exporting HDFS bridge HDFS data nodes HDFS Writer 1 HDFS Writer N PDW V2 …

Functional Behavior – Export (CETAS) For exporting relational PDW data into HDFS Output folder/directory in HDFS may exist or not On failure, cleaning up files within the directory, e.g. any files created in HDFS during CETAS (‘one-time best effort’) Fast-fail mechanism in place for permission check (by creating an empty file) Creation of files follows a unique naming convention {QueryID}_{YearMonthDay}_{HourMinutesSeconds}_{FileIndex}.txt CREATE EXTERNAL TABLE ClickStream WITH (LOCATION =‘hdfs://MyHadoop:5000/users/outputDir’, FORMAT_OPTIONS (FIELD_TERMINATOR = '|')) AS SELECT url, event_date,user_IP FROM ClickStream_PDW Example Output directory in HDFS 2. PDW table (can be either distributed or replicated) 1.

Round-Tripping via CETAS Leveraging export functionality for round-tripping data coming from Hadoop Parallelized import of data from HDFS Joining data from HDFS with data in PDW Parallelized export of data into Hadoop/HDFS New external table created with results of the join 3. Example CREATE EXTERNAL TABLE ClickStream_UserAnalytics WITH (LOCATION =‘hdfs://MyHadoop:5000/users/outputDir’, FORMAT_OPTIONS (FIELD_TERMINATOR = '|')) AS SELECT user_name, user_location, event_date, user_IP FROM ClickStream c, User_PDW u where c.user_id = u.user_ID PDW data 2. Joining incoming data from HDFS with PDW data 1. External table referring to data in HDFS

Configuration & Prerequisites for enabling Polybase

Enabling Polybase functionality 1. Prerequisite – Java RunTime Environment Downloading and installing Oracle’s JRE 1.6.x (> latest update version strongly recommended) New setup action/installation routine to install JRE [setup.exe /action=InstallJre] 2. Enabling Polybase via sp_configure & Reconfigure Introducing new attribute/parameter ‘Hadoop connectivity’ Four different configuration values {0; 1; 2; 3} : exec sp_configure ‘Hadoop connectivity, 1’ > connectivity to HDP 1.1 on Windows Server exec sp_configure ‘Hadoop connectivity, 2’ > connectivity to HDP 1.1 on Linux exec sp_configure ‘Hadoop connectivity, 3’ > connectivity to CHD 4.0 on Linux exec sp_configure ‘Hadoop connectivity, 0’ > disabling Polybase (default) 3. Execution of Reconfigure and restart of engine service needed Aligning with SQL Server SMP behavior to persist system-wide configuration changes Decoupling setup/installation routine from enabling the actual functionality Legal reasons why we cannot provide an appliance fully prepped before handing out to customer (violating appliance model) Sp_configure to enable Polybase; 3 different supported Hadoop configurations/distributions Underneath we are referring to the different jar files/Hadoop clients needed to connect to Hadoop Hadoop 1.0 vs. Hadoop 2.0 Homogenous environment required; breaking changes; HDFS bridge not transparant against it

Summary

Polybase features in SQL Server PDW 2012 1. Introducing concept of External Tables and full SQL query access to data in HDFS Introducing HDFS bridge for direct & fully parallelized access of data in HDFS Joining ‘on-the-fly’ PDW data with data from HDFS Basic/Minimal Statistic Support for data coming from HDFS Parallel import of data from HDFS in PDW tables for persistent storage (CTAS) Parallel export of PDW data into HDFS including ‘round-tripping’ of data (CETAS) Support for various Hadoop distributions 2. 3. 4. 5. 6. 7.

Related PASS Sessions & References Polybase – SQL Server Website http://www.microsoft.com/en-us/sqlserver/solutions-technologies/data-warehousing/polybase.aspx PDW Architecture Gets Real: Customer Implementations [SA-300-M] - Friday April 12, 10am-11am Speakers: Murshed Zaman and Brian Walker @ Sheraton 3 Online Advertising: Hybrid Approach to Large-Scale Data Analysis [DAV-303-M] – Friday April 12, 2:45pm-3:45pm Speakers: Dmitri Tchikatilov, Anna Skobodzinski, Trevor Attridge, Christian Bonilla @ Sheraton 3

Win a Microsoft Surface Pro! Complete an online SESSION EVALUATION to be entered into the draw. Draw closes April 12, 11:59pm CT Winners will be announced on the PASS BA Conference website and on Twitter. Go to passbaconference.com/evals or follow the QR code link displayed on session signage throughout the conference venue. Your feedback is important and valuable. All feedback will be used to improve and select sessions for future events.

Thank you! Diamond Sponsor Platinum Sponsor