Polybase Didn’t That Go Out in the 70’s Stan Geiger.

Polybase Didn’t That Go Out in the 70’s Stan Geiger

Where in the world are we?
Data sources OLTP ERP CRM LOB ETL Data warehouse BI and analytics Dashboards Reporting … data warehousing has reached the most significant tipping point since its inception. The biggest, possibly most elaborate data management system in IT is changing. – Gartner, “The State of Data Warehousing in 2012”

The Data Warehouse of the Future?
Diverse Big Data Workload Centric Approach Data stored on multiple platforms Physically distributed data warehouse data warehouse appliances columnar RDBMSs NoSQL databases MapReduce tools, and HDFS. Big Data is “all data”. Multiple big data structures create multiple platform options for storing data. The DW becomes spread across multiple platforms as a result. This is because no one platform will run query and analysis workloads efficiently across all data. Diverse data will be loaded onto the platform based on storage, processing and budget requirements. While a multi-platform approach adds more complexity to the data warehouse environment, BI/DW professionals have always managed complex technology stacks successfully, and end-users love the high performance and solid information outcomes they get from workload-tuned platforms.

PolyBase As previously mentioned, PolyBase is currently utilized by: SQL Server 2016 Azure SQL DW Analytics Platform System (APS) Uses T-SQL statements to access data stored in HDFS or Azuare Blob Storage. PolyBase was initially available in PDW (Parallel Data Warehouse) MS data appliance. PolyBase addresses one of the main customer pain points in data warehousing: accessing distributed datasets. Talked about this earlier….With the increasing volumes of unstructured or semi-structured data sets, users are storing data sets in more cost-effective distributed and scalable systems, such as Hadoop and cloud environments (for example, Azure storage) Originally SQOOP was used, but the data was actually moved from Hadoop cluster into SQL Server for querying. Using PolyBase it is possible to integrate data from two completely different file systems, providing freedom to store the data in either place. No longer will people start automatically equating retrieving data in Hadoop with MapReduce. With PolyBase all of the SQL knowledge accumulated by millions of people becomes a useful tool which provides the ability to retrieve valuable information from Hadoop with SQL.

PolyBase Use T-SQL to store data in SQL Server from Hadoop or Azure as tables. Knowledge of Hadoop or Azure is not required to use. Pushes computation to where data resides Export relational data into Hadoop or Azure Based on statistics and corresponding costs, SQL Server decides when to generate map jobs on the fly, to be executed within Hadoop. This is also transparent to the actual end user or application. Ability to create column-store table on-the-fly via T-SQL to leverage SQL Server’s column-store technology

PolyBase - External Tables, Data Sources & File Formats
Your Apps PowerPivot PowerView Data Scientists, BI Users, DB Admins SQL Server w/ PolyBase Social Apps Sensor &RFID Mobile Apps Web Apps External Table Hadoop External Data Source External File Format Utilization Query data stored in Azure blob storage. Azure blob storage is a convenient place to store data for use by Azure services. PolyBase makes it easy to access the data by using T-SQL. Integrate with BI tools. Use PolyBase with Microsoft’s business intelligence and analysis stack, or use any third party tools that is compatible with SQL Server. Import data from Hadoop or Azure blob storage. Leverage the speed of Microsoft SQL’s columnstore technology and analysis capabilities by importing data from Hadoop or Azure blob storage into relational tables. There is no need for a separate ETL or import tool. Export data to Hadoop or Azure blob storage. Archive data to Hadoop or Azure blob storage to achieve cost-effective storage and keep it online for easy access. Performance To improve query performance, and similar to scaling out Hadoop to multiple compute nodes, you can use SQL Server PolyBase scale-out groups. This enables parallel data transfer between SQL Server instances and Hadoop nodes, and it adds compute resources for operating on the external data. This allows the ability to scale out as compute requires. PolyBase Split-Based Query Processing Relational DW

PolyBase Scenarios Querying Run T-SQL over HDFS
Combine data from different Hadoop clusters Join relational with non-relational data ETL Subset of Hadoop in Columnar Format Enable data aging scenarios to more economic storage Allows building of multi-temperate DW platforms SQL Server acts as hot query engine processing most recent data sets Aged data immediately accessible via external tables No need to groom data Hybrid (Azure Integration) Mesh-up on-premise and cloud apps Bridge between on-premise and Azure Querying: Customer Value Ease-of-use & Improved Time-To-Insights Build the data lake w/o heavily investing into new resources, i.e. Java & map/reduce experts Leverage familiar & mature T-SQL scripts and constructs Seamless tool integration w/ PolyBase ETL: Avoids the need of maintaining a separate import or export utility Allows building multi-temperature DW platforms PDW/APS acts as hot query engine processing most recent/relevant data sets Aged data immediately accessible via external tables No need for deleting any data anymore Hybrid: Indefinite storage and compute Azure as extension for your on-premise data assets Cloud transition on your own terms Move only subsets of on-premise data, e.g. non-sensitive data Leverage new Azure data services Reduced capex & availability of new emerging data services in Azure for on-premise focused users

Requirements Server: Microsoft .NET Framework 4.5.
Oracle Java SE RunTime Environment (JRE) version 7.51 or higher (64-bit). Minimum memory: 4GB Minimum hard disk space: 2G TCP/IP connectivity must be enabled. External Data Source: Hadoop clusterHortonworks HDP 1.3 on Linux/Windows Server Hortonworks HDP 2.0 – 2.3 on Linux/Windows Server Cloudera CDH 4.3 on Linux Cloudera CDH 5.1 – 5.5 on Linux Azure blob storage account

Configuration PolyBase Data Movement Service PolyBase Engine
Install PolyBase PolyBase Data Movement Service PolyBase Engine Configure SQL Server and enable the option Configure Pushdown (Not Required) Scale Out (Not Required) Create Master Key and Scoped Credential Create external data source Create external file format Create external table Pushdown forces the query to use MapReduce to process the query. To improve query performance, enable pushdown computation to a Hadoop cluster: 1. Find the file yarn-site.xml in the installation path of SQL Server. Typically, the path is: C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\Binn\Polybase\Hadoop\conf\ 2. On the Hadoop machine, find the analogous file in the Hadoop configuration directory. In the file, find and copy the value of the configuration key yarn.application.classpath. 3. On the SQL Server machine, in the yarn.site.xml file, find the yarn.application.classpath property. Paste the value from the Hadoop machine into the value element.

PolyBase Create external data source (Hadoop).
Create external file format (delimited text file). Create external table pointing to file stored in Hadoop. CREATE EXTERNAL DATA SOURCE hdp2 with ( TYPE = HADOOP, LOCATION ='hdfs://10.xxx.xx.xxx:xxxx', RESOURCE_MANAGER_LOCATION='10.xxx.xx.xxx:xxxx') CREATE EXTERNAL TABLE [dbo].[CarSensor_Data] ( [SensorKey] int NOT NULL, [CustomerKey] int NOT NULL, [GeographyKey] int NULL, [Speed] float NOT NULL, [YearMeasured] int NOT NULL ) WITH (LOCATION='/Demo/car_sensordata.tbl', DATA_SOURCE = hdp2, FILE_FORMAT = ff2, REJECT_TYPE = VALUE, REJECT_VALUE = 0 CREATE EXTERNAL FILE FORMAT ff2 WITH ( FORMAT_TYPE = DELIMITEDTEXT, FORMAT_OPTIONS (FIELD_TERMINATOR ='|', USE_TYPE_DEFAULT = TRUE) Create External Data Source TYPE = [ HADOOP | SHARD_MAP_MANAGER | RDBMS | BLOB_STORAGE] For HADOOP, specifies the Uniform Resource Indicator (URI) for a Hadoop cluster. For Azure blob storage with Hadoop, specifies the URI for connecting to Azure blob storage. he fully qualified domain name (FQDN) of the Azure storage account. RDBMS For RDBMS, specifies the logical server name of the remote database in Azure SQL Database. BLOB_STORAGE For bulk operations only, LOCATION must be valid the URL to Azure Blob storage and container. Specify the RESOURCE_MANAGER_LOCATION option to enable push-down computation to Hadoop for PolyBase queries. Create External File Format File Formats supported Delimited text Hive RCFile Hive ORC Parquet FIELD_TERMINATOR = field_terminator Applies only to delimited text files. This specifies one or more characters that mark the end of each field (column) in the text-delimited file. The default is the pipe character ꞌ|ꞌ. For guaranteed support, we recommend to use one or more ascii characters. STRING_DELIMITER = string_delimiter Specifies the field terminator for data of type string in the text-delimited file. The string delimiter is one or more characters in length and is enclosed with single quotes. The default is the empty string “ DATE_FORMAT = datetime_format S USE_TYPE_DEFAULT = { TRUE | FALSE } Specifies how to handle missing values in delimited text files when PolyBase retrieves data from the text file. TRUE When retrieving data from the text file, store each missing value by using the default value for the data type of the corresponding column in the external table definition. For example, replace a missing value with: 0 if the column is defined as a numeric column. Empty string "" if the column is a string column. if the column is a date column. FALSE Store all missing values as NULL. Any NULL values that are stored by using the word NULL in the delimited text file will be imported as the string 'NULL'. pecifies a custom format for all date and time data that might appear in a delimited text file.

PolyBase - Ad-Hoc Query joining relational with Hadoop data
Who drives faster than 35 Miles > joining structured customer data stored in SQL Server with sensor data SELECT DISTINCT Insured_Customers.FirstName, Insured_Customers.LastName, Insured_Customers.YearlyIncome, Insured_Customers.MaritalStatus into Fast_Customers from Insured_Customers INNER JOIN ( select * from CarSensor_Data where Speed > 35 ) as SensorD ON Insured_Customers.CustomerKey = SensorD.CustomerKey ORDER BY YearlyIncome CREATE CLUSTERED COLUMNSTORE INDEX CCI_FastCustomers ON Fast_Customers; Insured_Customers is in SQL Server. Fast_Customers will be a table stored in SQL Server. CarSensor_Data is located in Hadoop. CREATE CLUSTERED COLUMNSTORE INDEX makes Fast_Customers a columnstore table.

Predicate Pushdown Use predicate pushdown to improve performance for a query that selects a subset of columns from an external table Predicate pushdowns will be the least compute intense comparisons Hadoop can do. Expressions and operators for predicate pushdown. Binary comparison operators ( <, >, =, !=, <>, >=, <= ) for numeric, date, and time values Arithmetic operators ( +, -, *, /, % ) Logical operators (AND, OR) Unary operators (NOT, IS NULL, IS NOT NULL) Who knows what pushdown is/ Pushdown is when SQL Server foces a MapReduce job on the Hadoop cluster. In this query, SQL Server initiates a map-reduce job to pre-process the Hadoop delimited-text file so that only the data for the two columns,customer.name and customer.zip_code, will be copied to SQL Server. select player ,year ,AtBat ,strikeouts ,walks ,hits ,doubles ,triples ,homeruns FROM [dbo].[BattingAverages] WHERE year > 1922

Scale Out A standalone SQL Server instance with PolyBase can become a performance bottleneck when dealing with massive data sets in Hadoop or Azure Blob Storage. The PolyBase Group feature allows you to create a cluster of SQL Server instances to process large data sets from external data sources, such as Hadoop or Azure Blob Storage, in a scale-out fashion for better query performance. Head node The head node contains the SQL Server instance to which PolyBase queries are submitted. Each PolyBase group can have only one head node. A head node is a logical group of SQL Database Engine, PolyBase Engine and PolyBase Data Movement Service on the SQL Server instance.+ Compute node A compute node contains the SQL Server instance that assists with scale-out query processing on external data. A compute node is a logical group of SQL Server and the PolyBase data movement service on the SQL Server instance. A PolyBase group can have multiple compute nodes. PolyBase queries are submitted to the SQL Server on the head node. The part of the query that refers to external tables is handed-off to the PolyBase engine. It parses the query on external data, generates the query plan and distributes the work to the data movement service on the compute nodes for execution. After completion of the work, it receives the results from the compute nodes and submits them to SQL Server for processing and returning to the client.

Query Troubleshooting
sys.dm_exec_distributed_requests sys.dm_exec_distributed_request_steps sys.dm_exec_distributed_sql_requests sys.dm_exec_external_work sys.dm_exec_distributed_requests - contains info about all current and recent Polybase queries. A query involving regular SQL and external SQL tables will be decomposed into various statements/requests executed across the various compute nodes. sys.dm_exec_distributed_request_steps - holds information about all steps that compose a given PolyBase request or query.sys.dm_exec_distributed_sql_requests - Holds information about all SQL query distributions as part of a SQL step in the query. sys.dm_exec_external_work - Returns information about the workload per worker, on each compute node. identify the work spun up to communicate with the external data source (e.g. Hadoop or external SQL Server).

Polybase Didn’t That Go Out in the 70’s Stan Geiger.

Similar presentations

Presentation on theme: "Polybase Didn’t That Go Out in the 70’s Stan Geiger."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Polybase Didn’t That Go Out in the 70’s Stan Geiger.

Similar presentations

Presentation on theme: "Polybase Didn’t That Go Out in the 70’s Stan Geiger."— Presentation transcript:

Similar presentations

About project

Feedback