Henk van der Valk Microsoft*

Henk van der Valk Microsoft*
SQL Server Analytics Platform System (APS) vs SQL Server 201x Let's see the big differences Henk van der Valk Microsoft* Session Level: Intermediate (/Advanced)

With thanks to our Sponsors
PASS SQL Saturday – Holland

Speaker Introduction 2002- Largest SQL DWH in the world (SQL2000)
@HenkvanderValk Speaker Introduction 9 years active in SQLPass community! 10 years of Unisys-EMEA Performance Center 2002- Largest SQL DWH in the world (SQL2000) Project Real – (SQL 2005) ETL WR - loading 1TB within 30 mins (SQL 2008) Contributor to SQL performance whitepapers Perf Tips & tricks: Schuberg Philis- 100% uptime for mission critical apps Since april 1st, 2011 – Microsoft SQL PDW/APS! All info represents my own personal opinion (based upon my own experience) and not that of Microsoft

What you will learn… Why APS & the scale factor over SQL2014!
APS to jumpstart your (upcoming) big data project Leverage your existing SQL skills Even 200 GB can be a Big Data challenge when time & complexity become the enemy ; Combine internal, external , structured & unstructured data! Think about your gateway to the cloud Speed to lower time to insight!

Agenda Why a SQL Scale out architecture & Data Analytics Platform System? Live Comparison Demos: Data loading / export Building a Clustered ColumnStore Query performance Polybase: Query twitter data stored in HADOOP from APS Microsoft CyberCrime Center on APS demo! ROLAP SQL Kitchen / Q&A Artur

The traditional data warehouse
11/26/2017 The traditional data warehouse BI and analytics Dashboards Reporting Real-time data 2 Data warehouse Increasing data volumes 1 ETL New data sources & types 3 Cloud-born data 4 Data sources OLTP ERP CRM LOB Non-Relational Data Devices Web Sensors Social © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

The Modern Data Platform
11/26/2017 The Modern Data Platform BI & ANALYTICS Self-service Collaboration Corporate Predictive Mobile DATA ENRICHMENT AND FEDERATED QUERY Extract, transform, load Single query model Data quality Master data management DATA MANAGEMENT & PROCESSING Non-relational Relational Analytical Streaming Internal & External  INFRASTRUCTURE OLTP ERP CRM LOB Data sources Non-relational data Devices Web Sensors Social © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Parallel or not Parallel
Scale up (SMP) Build for specific requirement Build HA etc. additionally Maintain and Tune (Load/File Distribution) Unknown Future workloads Still a very good data mart solution in a Hub and Spoke architecture with SQL Server PDW Scale out (MPP) Resilient & Predictable Big data / DW Best Practices in a box Deploy Fast and Drive Value Built-in HA Scalable (start small/grow when needed) Appliance SQL Instance #2 SQL Instance #1 SQL Instance #4 SQL Instance #3 SQL Instance #8 SQL Instance #.... Storage 160GB Fact Table on SMP > Select on this table = querying 1x 160GB 160GB Fact Table on PDW (MPP, distributed across 8 nodes and 64 distribution) > Select on this table = querying 64 x 2.5GB (in parallel) Add 2 nodes increments = querying 80 distributions x 2 GB (in parallel) and so on.. Simply put MPP breaks this down and runs parts in parallel. The more nodes the more parallelism and ability to hit the target time SQL Instance Storage

APS : Querying 1 Petabyte of data in 1 Second
~294 miljard rows

Integrate Relational + Non-Relational
Virtualization PolyBase – Single Query for Structured and Unstructured SEAMLESS & HIGH SPEED Access Query and join Hadoop tables with relational tables in parallel Use SQL query language Leverages the power of MPP to enhance the query execution performance No need to duplicate Hadoop data into DW or vice versa Works with all major Hadoop distributions Predicate pushdown onto Hadoop platform to minimize data transfer Data compression Existing SQL Skillset No IT Intervention Save Time and Costs Analyze All Data Types © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

SQL Server Enterprise Edition and APS
11/26/2017 3:52 PM SQL Server Enterprise Edition and APS SQL Server Enterprise Edition Microsoft Analytics Platform System Form Factors Software: SQL Server 2014 Reference Architecture: Fast Track Appliance: Quickstart Appliance: APS Optimal Capacity From 0 through 50 terabytes From 0 terabytes through 6 petabytes Unique Characteristics and Features SQL Server SMP integration which includes: Updateable clustered columnstore indexes In-memory OLTP tables for optimized data loading Joins structured and unstructured data together through Big Data integration with PolyBase Includes HDInsight Hadoop region sitting over the fabric, with shared metered resources for CPU, memory, and storage Store and retrieve data from Azure Storage Benefit from on-premises data compression for fast data transfer Scales out in a near linear fashion as hardware is added for rapidly growing data requirements Provides massively parallel processing shared-nothing architecture for heavily parallelized performance requirements (for example, heavy concurrency or computation) © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Demo gear SQL2014 EE on SMP vs SQL Server APS (MPP) 8 node APS (MPP)
256 GB RAM per node 280 spindles == Data spread across 64 separate distributions Hortonworks Hadoop on Windows VS SMP: DL580 / 256GB SAN: High-End flash drives + 100x SAS spindles

APS for Optimized Load speed
1) Bulk Inserting data into HEAP & Clustered Columnstore Index APS for Optimized Load speed Bulk Insert / BCP / SSIS vs DWLoader Load from GZIP + wildcard Session Code - Session Title

SQL2014 (SMP) Bulk Insert Question :
How long does it take to Load a single 75 GB flatfile with 600 million rows on SQL 2014 SMP HEAP & CCI ? 1hour 16min

SQL2014 Bulk Insert into CCI
Or almost 3 hours for a direct load into SQl2014 Clustered Columnstore Index…

SQL2014 – Direct load into CCI
ALTER RESOURCE GOVERNOR RESET STATISTICS See:

SQL2014EE - Enable Resource Governor
Enable RG and ALTER RESOURCE GOVERNOR RESET STATISTICS Or , Add more memory for default Resource Group : (~2x Dataset?) ALTER WORKLOAD GROUP [default] WITH(request_max_memory_grant_percent=95) ALTER RESOURCE GOVERNOR RESET STATISTICS ALTER RESOURCE GOVERNOR RECONFIGURE

How to load data as fast as possible into SQL201x (or any DB)
Use Modulo Function to write to separate tables Use partition switching to consolidate See:

APS –Direct Insert / Parallel Write
APS (AU1-2) Support for loading Gzip archive files direct Use wildcards to Load multiple files 1/56 Gb/s CMP01 Load server 1/56 Gb/s CMP02 10 & 56 Gb/s 1/56 Gb/s Dwloader.exe .. 1/56 Gb/s CMPxx 1/56 Gb/s CTL01 CMP56 1/56 Gb/s

Summary : Data loading SQL 2012/2014/APS
Single 75 GB / 600 Million row LineItem flat file into SQL2014 HEAP SQL 2014 SMP SQL 2012 SMP SQL APS

APS loads data in parallel vs Single core
SMP SQL201x Bulk insert uses a single core per task PDW Bulk insert is writing into all distributions in parallel

Data loading - conclusion
Single 75 GB / 600 Million row LineItem flat file /note : Load server is the bottleneck … PDW/ APS loads single flatfile data 15-36x faster SQL2014 CCI – load into HEAP + build CCI afterwards is fastest.

APS admin console Direct Load into table with CCI Printscreen!

SSIS APS Destination Adapter
Insert data faster in parallel : Single 75 GB / 600 Million Row LineItem flat file APS SSIS Dataload 2.5x Faster

SSIS APS Destination Adapter
APS data load options: Append-Query while load Reload Upsert Fastappend

Optimize Data Export speed
2) Exporting Data out of SQL Optimize Data Export speed BCP out vs RTC Session Code - Session Title

Remote table copy to SMP
Creating a remote table on an regular SQL 201x SMP Server utilizing all available cores  APS Exports data from all distributions in Parallel CREATE REMOTE TABLE SMP_DB.dbo.LineItem_test AT ( 'Data Source = {Destination},1433 ; …' ) AS SELECT * FROM DemoDB_SQLSat.dbo.lineitem

Ease of Use 3) Building an Updateable Clustered Columnstore Index
Session Code - Session Title

How to load & create a CCI on SQL2014 as fast as possible?
1) Load all data in SQL2014 HEAP first 2) build the CCI separately on SQL2014 heap table Create clustered columnstore index CCI on [lineitem_600M] Waitstats: COLUMNSTORE_BUILD_THROTTLE = assign more RAM SOS_PHYS_PAGE_CACHE = memory housekeeping

K rows/sec @ linear scale
SQL2014EE – Create CCI Test: DROP INDEX LineItem_CCI on demoDB_SMP_CCI.dbo.LINEITEM_HEAP; Create Clustered Columnstore Index LineItem_CCI on dbo.LINEITEM_HEAP WITH (MAXDOP = xx); Maxdop Duration # K Rows/sec K linear scale 1 1hr35min 104 2 49min27sec 202 208 3 34min8sec 293 312 4 26min 382 416 5 21min 461 520 6 18min 539 624 8 14min 53sec 671 832 14 10min 7sec 987 1456 64 8min 48sec 1137 6656

How to rebuild a partitioned CCI as fast as possible
Take numa Into account @Echo off ECHO Rebuilding CCI on multiple partitions in parallel ECHO START CCI Rebuild all Partitions in Parallel with MDOP1: FOR /L %%i IN (%1,1, %2) DO ( START /MIN "%%1" sqlcmd -E -S x\SQL2014EE -d DemoDB_SMP_CCI -Q "alter index CIX on LINEITEM_STAGING_%%i rebuild partition = %%i with (maxdop=1)" )

SQL APS - Ease of use CCI already 1+ years available in APS
Part of Create table statement CREATE TABLE [dbo].[lineitem_CCI_SQLSaturday] ( [l_orderkey] bigint NOT NULL, [l_partkey] int NOT NULL, [l_suppkey] int NOT NULL, [l_linenumber] int NOT NULL, [l_quantity] smallint NOT NULL, [l_extendedprice] float NOT NULL, [l_discount] smallmoney NOT NULL, [l_tax] smallmoney NOT NULL, [l_returnflag] char(1) COLLATE Latin1_General_100_CI_AS_KS_WS NOT NULL, [l_linestatus] char(1) COLLATE Latin1_General_100_CI_AS_KS_WS NOT NULL, [l_shipdate] smalldatetime NOT NULL, [l_commitdate] smalldatetime NOT NULL, [l_receiptdate] smalldatetime NOT NULL, [l_shipinstruct] char(25) COLLATE Latin1_General_100_CI_AS_KS_WS NOT NULL, [l_shipmode] char(10) COLLATE Latin1_General_100_CI_AS_KS_WS NOT NULL, [l_comment] varchar(44) COLLATE Latin1_General_100_CI_AS_KS_WS NOT NULL ) WITH (CLUSTERED COLUMNSTORE INDEX, DISTRIBUTION = HASH([l_orderkey]));

CCI build - Conclusion SQl2014EE: APS
Load data into heap, build CCI separate For absolute fastest CCI build take #cores/socket into account APS SSIS Data Load is signficant faster! Direct load into CCI with DWLoader is fastest

Adhoc query / Raw Data scan speed
4) Query Performance Adhoc query / Raw Data scan speed Session Code - Session Title

SMP VS MPP Sales per product category (1 billion row factsales)
Task Duration SMP APS Improvement Data loading 600 Million Rows 1hour 19min 4min30sec 17½ x Query Sales per product category (1 billion row factsales) current year vs previous: 1 bill. Rows /no where clause 2 bill. Rows /no where clause 5 bill. Rows /no where clause 15 sec 42sec 1min 26sec - 1 second 1sec 2 seconds 7 seconds 15x 42x 43x Sales insights over Yr per state 3min 18sec 99x

Maximize Backup/Restore speed
5) Backup of a SQL Database Maximize Backup/Restore speed Session Code - Session Title

How to backup SQL201x as fast as possible
– 3) Optimizing SQL throughput with parameters: DBCC TRACEON (3605, -1) DBCC TRACEON (3213, -1) BACKUP DATABASE [TPCH_1TB]TO DISK = N'C:\DSI3400\LUN00\backup\TPCH_1TB-Full', DISK = N'C:\DSI3500\LUN00\backup\File2', DISK = N'C:\DSI3500\LUN00\backup\File3', DISK = N'C:\DSI3500\LUN00\backup\File4', DISK = N'C:\DSI3500\LUN00\backup\File5', DISK = N'C:\DSI3500\LUN00\backup\FileX' WITH NOFORMAT, INIT,NAME = N'TPCH_1TB-Full Database Backup', SKIP, NOREWIND, NOUNLOAD, COMPRESSION,STATS = 10 – Magic: ,BUFFERCOUNT = 2200 ,BLOCKSIZE = ,MAXTRANSFERSIZE= GO DBCC TRACEOFF(3605, -1) DBCC TRACEOFF(3213, -1) Create parallel writes manually yourselves

APS Backup Parallel Node Backup to External fileshare by default
Benefit from 56 GBit Infiniband network cards for best throughput Backup sets are compressed automatically Extra Backup Server Hardware can be ordered separately

Parallel & Seamless querying of (twitter) data stored on Hadoop
7) Hadoop integration (APS only) Parallel & Seamless querying of (twitter) data stored on Hadoop

APS Hadoop integration
Use External tables to represent HDFS data Span both HDFS and PDW data in single Query Connect to Internal or external Hadoop cluster / Cloud Parallel (compressed) data import and export from/to HDFS T-SQL Relational databases Social Apps Sensor & RFID Mobile Apps Web Apps Enhanced SQL APS query engine Query Hadoop from within APS by defining external tables ; just imagine you run a query joining data from both worlds ….! HDFS RDBMS Unstructured data Structured data

APS Polybase – Parallel Data Transfers
Server & Tools Business 11/26/2017 APS Polybase – Parallel Data Transfers EXEC sp_configure 'hadoop connectivity', 1 RECONFIGURE Control Node SQL Server Compute Node … PDW Cluster SMP SQOOP Connector (connections) Modular Design Versus: Bi-directional Direct parallel hdfs block access In APS 56Gbit/Sec Infiniband Namenode (HDFS) Node DN Hadoop Cluster 41 © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Demo #3 Polybase! Query & Store Hadoop data Bi-directional seamless & fast from within APS

Time to Insights.... APS Cybercrime Demo Session Code - Session Title

Big Data The Data lake syndrome APS is the gateway to the cloud!
Polybase provides a high-speed bridge for everyone APS is the gateway to the cloud! Avoid HADOOP becoming another island Don’t strand you data on an island!

Reduce Cube Processing timings
8) Cube - Rolap mode support Reduce Cube Processing timings Session Code - Session Title

APS / SSAS Rolap

Special SSAS ROLAP option for APS
EnableRolapDistinctCountOnDataSource Add Links to molap GB Ethernet Infiniband

Why APS When I Have SQL Server?
Manageable Costs Scale Out SQL MPP versus Scale Up SMP “Small, Big & Huge Data” Integration Query Performance … Appliance Simplicity: HW + SW For existing Microsoft SQL Server customers, there are plenty of reasons to be excited about SQL Server 2012 PDW. The top reasons to upgrade from SQL Server to SQL Server PDW are as follows: Moving from a Scale Up (and SMP architecture) to a Scale Out (and MPP architecture) SQL Server is a SMP, scale up architecture. This solution runs queries sequentially on a shared everything architecture. This means everything is processed on a single box and shares memory, disk, I/O operations. In order to get more horse power out of your SMP box, you will need to buy a brand new HW server every time with diminishing returns up to a maximum scale limit. SQL Server PDW is a scale out architecture. This solution is a shared nothing architecture where there is multiple physical nodes; each running its own instance of SQL Server with dedicated CPU, memory, and storage. As queries go through the system, they will be broken up to run simultaneously over each physical node. The benefit for you is the ability to add additional hardware to your deployment to linearly scale out to Petabytes of data. There are no diminishing returns. This can get your SQL Server data warehouse to extend beyond most large warehouse implementations. “Big Data” Integration with PolyBase PolyBase is is a fundamental breakthrough on the data processing engine which enables integrated query across Hadoop and relational data. Without manual intervention, PolyBase Query Processor can accept a standard SQL query and join tables from a relational source with tables from a Hadoop source to return a combined result seamlessly to the user. This feature is only available in PDW and is not available in SQL Server software (at this time). New xVelocity updateable columnstore Although SQL Server 2012 released with xVelocity columnstore index, SQL Server 2012 PDW will have the next version of xVelocity columnstore that is both clustered and updateable. These features allow you to make the xVelocity columnstore as the primary storage structure, which saves roughly 70% on overall storage use by eliminating the row store copy of the data entirely. Additionally, updates and direct bulk load are fully supported on xVelocity columnstore, simplifying and speeding up data loading, and enabling real-time data warehousing and trickle loading; all the while maintaining interactive query responsiveness. This version of xVelocity will not be in SQL Server 2012 until the next major release. Benefits of Hardware + Software solution Appliances are co-engineered by Microsoft and key hardware partners to deliver a fully integrated hardware and software solution giving customers the fastest time to solution. Customers are shipped the fully integrated appliance with HW and SW pre-built in the factory. This means customers are given a true plug and play experience when ordering Microsoft appliances. Customers need not worry about putting together the HW: Servers, Storage arrays, Network Switches, Cables, Licenses, Power distribution units, Racks as well as Configuring and Tuning the SW. Everything comes put together and fully installed at the factory. Expected query performance gains Due to the architecture of PDW and dependent on the complexity of the queries, customers have seen improvements from 5x to 100x improvement over the same query run in SQL Server software. As an example, Hy-Vee saw one of their queries run 100x faster simply by running them in PDW. Manageable costs While other vendors may be charging in excess of $1 Million dollars for their scale out, MPP appliance, Microsoft SQL Server 2012 PDW has the lowest Price/Terabyte over every other company by a significant margin (~2x lower than the market). This enables every SQL Server customer running data warehouses to evaluate a solution that is within their reach. Speed & Scale!

MGXFY13 11/26/2017 … What would YOU do if you could query all your data & more within seconds © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Further Reading www.microsoft.com/APS
SQL Server APS Landing Page: Introduction to Polybase: Sept APS TCO comparison:

Henk.vanderValk@microsoft.com www.henkvandervalk.com
Q&A

With thanks to our Sponsors!
PASS SQL Saturday – Holland

Henk van der Valk Microsoft*

Similar presentations

Presentation on theme: "Henk van der Valk Microsoft*"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Henk van der Valk Microsoft*

Similar presentations

Presentation on theme: "Henk van der Valk Microsoft*"— Presentation transcript:

Similar presentations

About project

Feedback