Download presentation
Presentation is loading. Please wait.
1
SQL Server PolyBase and Dell EMC Isilon storage
Smith, Matt F | SQL Server PolyBase and Dell EMC Isilon storage
2
A little bit about me. I work for the Dell EMC Big Data/IoT Solution Engineering Team as a Solutions Architect (consulting & delivery) Started with SQL Server 7 back in the Dot-com days doing application development: ASP 3.0, VB6 and SQL Server. Lots of work with SQL Server over the years: Dev,DBA,BI Moved into the Big Data space a few years ago: Pivotal, Hortonworks, Cloudera.
3
Why I chose this topic. Personal interest in creating a performant entry-level Big Data solution for small to mid-sized Hadoop File System data sets (1-5TB). Reduce the common barriers of entry to Big Data: Training and experience, financial investment, infrastructure. Introduce Big Data into organizations through known & trusted products, leveraging existing skill sets. Realize value in helping DBA’s, developers and organizations start their big data journeys through low-risk pilot projects.
4
SQL Server PolyBase – a quick overview
Introduced and initially a component of SQL Server 2012 Parallel Data Warehouse (PDW), now known as Analytics Platform System (APS) Added to the SQL Server Product Family in 2016 Allows you to access semi-structured data (Schema on Read) located in an HDFS compliant file-storage through SQL Server Connect to Dell EMC Isilon, Azure blob storage, Hadoop or Cloudera
5
What is Isilon? Scale-out, Multi-protocol Network Attached Storage
Write with NFS, SMB3, FTP, HTTP and read immediately with another protocol 7000+ customers, scales from 15TB to 68PB OneFS file system – Isilon OS Each node includes processors, ram, network and disk. Compute on every node. Network: 10GbE, 40GbE. Wide selection of disk choices for nodes allow for tiering of hot data (Flash), Hybrid and Archive storage. Automate data movement across storage tiers Active Directory & Kerberos integration
6
Integrated Isilon and PolyBase
Web Click data PolyBase Cluster NFS Head Step 2: Jobs are run node info Compute node info Decision Support Databases Compute node info SMB, NFS, HTTP, FTP, HDFS Compute node info name node OLAP name node data node Getting data into Isilon in particular is very easy, with a wide variety of protocol support. So no matter what data source you’re contemplating, you can be assured that populating your Hadoop instance is going to be as simple as it possibly can be. name node name node Step 1: Much or all of the Data lives on the Isilon/Hadoop Cluster Isilon EDW
7
POC Architecture – MS SQL with Isilon HDFS
Isilon services as HDFS and Name-Nodes Clickstream Enables Microsoft T-SQL queries in the HADOOP environment Parallel operation on HADOOP and MS SQL in the same database DSS Very efficient methods (SMB, FTP, NFS, HDFS) for data import by DELL EMC ISILON MS SQL Integration 1. Direct queries from SQL to Isilon HDFS 2. External Pushdown Multiple Hadoop applications can even simultaneously access the same dataset (Isilon) at the same time Sensor Integrating these with SQL PolyBase is very simple. You’ve started with a Hadoop instance, and populate it with data from whatever sources you’d like, whether it’s clickstream data, other OLTP sources, sensor data and so forth. From the perspective of PolyBase, the EMC-enhanced Hadoop instance is simply another Hadoop instance. You just start using it. OLAP EDW MS SQL 2016 Enterprise Edition PolyBase ScaleOut Group
8
Configuring Isilon for PolyBase POC
Steps Create an Access Zone Configure HDFS Service Configure Network Create pdw_user Add Active Directory NS record & test
9
Configuring Isilon – POC
Verify Licenses Create an Access Zone
10
Configuring Isilon – POC
Configure HDFS
11
Configuring Isilon – POC
Configure SmartConnect
12
Configuring Isilon – POC
Create pdw_user
13
Configuring Isilon – POC
Add DNS entry, test.
14
Configuring Isilon - Resources
EMC Isilon Best Practices Guide for Hadoop Data Storage - Dell EMC OneFS with HDFS Reference Guide - Dell EMC
15
Install & Enable PolyBase on SQL Server
Install the Oracle Java SE Runtime Environment (JRE) 7.51 (x64) or 8. Do not install JRE 9!
16
PolyBase Data Sources Data sources include Isilon, Azure, Hortonworks and Cloudera Add Multiple Data Sources (pending support) Hortonworks HDP 1.3, 2.0, 2.1, 2.2 Cloudera CDH 4.3, 5.1
17
PolyBase External File Format
File formats are required You create and define your own File Formats Supported File Formats include: Text (delimited) Hive ORC – Optimized Row Columnar Hive RCFile - Record Columnar (key-value)
18
PolyBase External Tables
External Tables are schema on read You define external table columns
19
Table Statistics Add statistics to optimize query performance
20
Query Testing... Used Adam Machanic’s sp_WhoIsActive so I could get a better idea of what was going on. Observations… Lots of Temp Tables! Some data moving back and forth Things seemed to take a long time Data came back!
21
Insert into / Select from worked best.
Insert into / Select From to Load the data you want to work with locally Create indexes (Columnstore or other) and then join to local tables for best performance Joining from local tables to external tables just didn’t work well Nightly ETL processes to move relevant data from HDFS External Tables into SQL Server makes sense Opportunities for Improvement Add additional Scale-out compute nodes (I plan to test SQL Express!)…also Containers & Powershell provisioning TempDB on flash or SSD
22
DMV’s for troubleshooting and analysis
Many DMV’s exist. The link at the bottom of this slide includes a great process on how to use them.
23
Planning for Growth Data lakes tend to fill up.
24
HDFS: Standard Hadoop Cluster
SMB, NFS, HTTP, FTP HDFS: Standard Hadoop Cluster Web Click data Name node Name node Compute Compute Node reply Node reply Node reply Node reply node reply Data Data NFS MAP Reduce MAP Reduce file copy3 file copy2 MAP Reduce MAP Reduce node info Decision Support Databases node info file copy3 file copy2 MAP Reduce MAP Reduce Landing Zone Servers HTTP MAP Reduce MAP Reduce CIFS file file file file FTP node info MAP Reduce MAP Reduce HDFS OLAP NFS MAP Reduce MAP Reduce node info file copy3 file copy2 MAP Reduce MAP Reduce MAP Reduce MAP Reduce Step 2: Data is copied into the Cluster (3 times) 3X node info file copy2 file copy3 MAP Reduce MAP Reduce Step 1: Data is copied into the Landing Zone Step 3: Hadoop Jobs are run EDW
25
The EMC Isilon Advantage for Analytics
1 Scale-Out Storage Platform Multiple applications & workflows 2 No Single Point of Failure Distributed NameNode 3 End-to-End Data Protection SnapshotIQ, SyncIQ, NDMP Backup 4 Industry-Leading Storage Efficiency >80% Storage Utilization 5 Independent Scalability Add compute & storage separately 6 Multi-Protocol Industry standard protocols NFS, CIFS, FTP, HTTP, HDFS HDFS EMC Isilon has recently introduced a new scale-out NAS solution for Hadoop that is designed to readily support business analytics as well other enterprise applications and workflows. (This eliminates the silo’d infrastructure approach used in many initial Hadoop deployments.) The new EMC solution also eliminates the “single-point-of-failure” issue. We do this by enabling all nodes in an EMC Isilon storage cluster to become, in effect, namenodes. This greatly improves the resiliency of your Hadoop environment. The EMC solution for hadoop also provides reliable, end-to-end data protection for Hadoop data including snapshoting for backup and recovery and data replication (with SyncIQ) for disaster recovery capabilities. Our new Hadoop solution also takes advantage of the outstanding efficiency of EMC Isilon storage systems. With our solutions, customers can achieve up to 80% or more storage utilization. EMC Hadoop solutions can also scale easily and independently. This means if you need to add more storage capacity, you don’t need to add another server (and vice versa). With EMC isilon, you also get the added benefit of linear increases in performance as the scale increases. EMC also recently announced that we are the 1st vendor to integrate the HDFS (Hadoop Distributed File System) into our storage solutions. This means that with EMC Isilon storage, you can readily use your Hadoop data with other enterprise applications and workloads while eliminating the need to manually move data around as you would with direct-attached storage.
26
Hadoop Architecture with Isilon
R (RHIPE) Mahout Hive HBase NameNode PIG Job Tracker Task Tracker DataNode Compute Node Compute Node Compute Node name node By leveraging an EMC platform like ECS, Isilon, or even DSSD, the data node capabilities are separated from the compute nodes. This offers a number of advantages. Ethernet name node data node name node name node Compute Node Compute Node Compute Node
27
Can you really do big data with SQL Server 2016+ ?
Yes you can, to a point. PolyBase and a Hadoop Distributed File System (HDFS) allows you to get started with Big Data Many organizations use Dell EMC Isilon storage, which supports HDFS This is a reasonable, workable solution, tested with 2.5TB of unstructured marketing analytics (clickstream) data
28
Your Next Step: Big data pilot project
Acquire some space on your organization’s Isilon cluster. Alternatively, you could look into Azure Blob Storage and Azure Compute (SQL Server & Windows Server). Load your data sets into your newly acquired storage Install SQL Server (Head) and one or two Compute Nodes Create your External Data Source, External File Format, External Tables and start experimenting
30
Special Thanks to: Dell EMC Big Data team - @DellEMCbigdata
Christian Scharrer - Dell EMC Senior Systems Engineer Rob Sonders – Dell EMC Microsoft Specialist – SQL Michael Wells - Dell EMC Senior Systems Dell EMC - Denver, CO office - Isilon Engineering
31
Thank You Sponsors Platinum Gold Silver Bronze Swag Venue
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.