Data Warehouse in the Cloud – Marketing or Reality? Alexei Khalyako Sr. Program Manager Windows Azure Customer Advisory Team
Data Warehouse we used to know High-End workload High-End hardware Special know-how
Reality is Thousands of departmental level DW Relatively low perf SLA *BeyeNetwork Big Data research
New BI demands Utilize external data sources Non Structured Data Origin is in the Cloud *BeyeNetwork Big Data research
New opportunity Platform is there “Closer” to data Iaas SQL VM Paas SQL Azure DB “Closer” to data Less administrative overhead Lower initial and TCO cost
SQL Server Data Warehousing in Windows Azure Virtual Machines Inspired by the Fast Track Reference Architecture guide Based on the High Memory images Up to 1TB MSDN: SQL Server Data Warehousing in Windows Azure Virtual Machines
High Memory VM in Azure
How to deploy Powershell script Windows Azure Gallery
The Azure Data Warehouse under the hood
Data Warehouse Lifecycle Thoughts on the architecture Creating DB Connectivity Populating Database Initial data loading OR Backup/Restore Incremental data loading Compression Query performance Architecture
Thoughts on the architecture Data Loading Minimize Log impact Scale loading streams Do not invent the wheel and follow the Data loading Performance guide Query Performance ! Do not invent the wheel and follow the Data loading Performance guide
Windows Azure VM Architecture Disks implemented as a shared multi-tenant service Built-in triple redundancy, optional geo-redundancy Performance less predictable than on-prem Host machines, storage services, network bandwidth shared between subscribers Perf can depend on where and when VM is provisioned Subject to maintenance operations Granular control & configurability vs. cost, simplicity, out of box redundancy Storage Stamp Stream Layer Partition Layer Front-ends LB Intra-stamp replication Geo-replication Storage Location Service To achieve the same level of redundancy in on-premises deployments, you would need to set up multiple disk arrays in multiple locations and a synchronization mechanism, such as, a storage area network (SAN) replication
Tweaks to improve IO Subsystem Database file initialization GPEdit.msc Data file placement SQL Striping for User Data and TempDB Aggregated throughput Set the size and data grow options wisely *You may do it differently. Then Create 350GB DB took ~3 hours Plan maintenance All options are there , but you need to double-check if they all correctly set and this why
Scaling IO Options Windows Storage Spaces SQL Data Files Log drive Not clear support story Spread File Group over all drives Windows 2012 Storage the Mention block sizes Separate the Log and Data disk
Scaling IO Options Data disk (read) LOG (write) SQLIO Single Data Disk (256K) SQLIO Windows Storage Spaces X3 Disks (256K) SQLIO SQL Striping x3 Disk CUMULATIVE DATA: throughput metrics: IOs/sec: 288 MBs/sec: 71.98 IOs/sec: 640.87 MBs/sec: 160.21 IOs/sec: 599.91 MBs/sec: 149.97 SQLIO Single Data Disk (64K) SQLIO Windows Storage Spaces X3 Disks (64K) SQLIO SQL Striping x3 Disk (64K) * CUMULATIVE DATA: throughput metrics: IOs/sec: 1215.13 MBs/sec: 75.94 IOs/sec: 2677.69 MBs/sec: 167.35 IOs/sec: 2742.22 MBs/sec: 171.38 * But we can access one file at the time!
Connectivity Options Windows Azure VM End Points Point-to-Site /Site-to-Site *Other options are also available ( FTP)
What and how we tested TPCH – star schema Workload is know, we wanted to see in the Cloud Size 200 GB
Getting initial data Copy backup to the Data Disks Backup/Restore to/from URL ETL to the new DB
URL is fast! Backup to the Local Data Disk Backup to the URL DB Size Time Speed 244GB 3 hours 22,978 MB/sec DB Size Time Speed 244GB 46 min 90,667 MB/sec Add Restore - have the throughput numbers
DB and Data Loading Data loading Query Performance Sizing Tools (BCP, SSIS..) Time SLA Query Performance Indexing strategy Sizing Compression
Loading Data in Azure Smaller batches (10K -15K rows) Retry logic Network latency is high Parallel loading!! Start with: SSIS for Hybrid Data Movement SSIS Performance and Operational guide Contrast cloud vs on-prem
Baseline Understand Data Sources performance Flat File in Azure VM ~60 MB/sec /reads SQLIO shows the max throughput of the IO subsystem on the DB side App performance can be different
Parallel Loading Flat file Max 60 MB/sec Flat file Max 60 MB/sec Mod(7) function 8 destinations to keep all CPU busy on the DW site
Begin to load
Monitoring Loading Performance You will be followed by TOP waits: ASYNC_NETWORK_IO PAGEIOLATCH_EX WRITELOG PAGEIOLATCH_UP SOS_SCHEDULER_YIELD PAGEIOLATCH_SH PAGELATCH_UP PREEMPTIVE_OLEDBOPS Network IO Disk IO CPU
Loading: table options Heap Heap compressed 780 772 573 rows Elapsed time: 01:06:15.313 780 772 573 rows Elapsed time: 05:12:06.094 Blob Finished, 1:39:05 PM, Elapsed time: 01:06:15.313 Finished, 3:01:24 PM, Elapsed time: 01:08:00.156 Finished, 8:58:09 AM, Elapsed time: 01:01:56.641 – heap blob
Loading: table options HEAP Clustered Index 780 772 573 rows Elapsed time: 01:06:15.313 Sort! Elapsed time: 01:20:12.547
Query Performance Heap Primary Key/Clustered Index Compression
Query performance: results
Please welcome on stage SQL 2014
What’s new? Data files to BLOBs Updateable Clustered Column Store index
Loading data Heap Clustered Column store Index 1 hour 1 min Load test 2 hours 16 min
SQL 2014
Query 19 Estimates vs Actual
And the winner is… SQL Server 2014!!
Summary Easy and fast deployment through he Gallery or PS scripts Azure Data Warehouse is consistent with the most of the best practices Query Loading Low Initial investments and TCO
THANK YOU! For attending this session and PASS SQLRally Nordic 2013, Stockholm