Presentation is loading. Please wait.

Presentation is loading. Please wait.

Flash Storage 101 Matt Henderson.

Similar presentations


Presentation on theme: "Flash Storage 101 Matt Henderson."— Presentation transcript:

1 Flash Storage 101 Matt Henderson

2 SAN Storage and Mixed Workloads
Unique and changing Per Server Per Application Per Day/Hour Week day versus weekend Adhoc reporting Administration Designed for Peak Not optimal

3 Legacy Technology Seek Time Legacy Disk Average Latency: 3-5+ ms
15K RPM HDD Seek Time

4 Storage Technology, Yesterday, Today and Beyond
Legacy Disk Avg Latency: ms “Disk Plus” Avg Latency: 1-2 ms Insanely Fast Avg Latency: 250 μs (μs = 1 millionth of a second) 20x faster than disk! 15K RPM HDD SSD NAND Flash

5 Flash Deployments PCIe card SSD All-flash array 1st generation
Fastest latency configuration Not sharable Crashes host on failure Read mostly Requires host based RAID Write cliff SSD 2nd generation Controller adds latency Sharable with chassis Won’t crash host Requires data segregation / RAID Write cliff All-flash array 3rd generation One block of storage, flash-specific RAID Sharable over fabric Enterprise redundancy Massive parallelism No write cliff PCIe cards - flash board straight in the PCIe bus lowest possible latencies high failure rate - compared to enterprise class hard drives relatively inexpensive read mostly workload can't scale past the number of x8 slots in the host chassis not sharable crashed card loses data can't reboot host until card replaced any flash level error crashes host suffers from write cliff uses host CPU's - drains effectiveness of host applications RAID over many cards multiplies chances of write cliff hitting transaction required OS level RAID increases boot time (possibly increases RTO - recovery time objective) uses host RAM flash board with controller in a hard drive form factor SSDs slower than PCIe controller removes failure crash problem RAID means multiplication of write cliff likelihood requires RAID hard drive form factor means must segregate storage (data vs logs vs tempdb) chassis controller not build for SSD speed sharable with hard drive based chassis - wasted space on most expensive tier 24 units of 200GB models = 4.8TB before adding RAID and separating data from logs and tempdb no more than 24 SSD's per 2U chassis sharable storage All-flash arrays entirely built for flash technology each vendor has their own story VIMM chip as foundation Violin patented write cliff avoidance 4k block device 12 RAID groups of 5 VIMMS VIMM technology PCIe direct attach or sharable as SAN RAID already done inside of box retains speed after failure hot swapable components hot spares spreads data over entire array host IO parsed into 4k chuncks (one 64k IO becomes sixteen 4k blocks) massive parallelization initial formatting randomizes 4k blocks 4k chunks written as 1k each to 5 VIMMS (4k + 1k of parity) replaceable components no single point of failure 1k chunk to VIMM is parsed and written at the bit level simultaneously relationship with toshiba – strategic partnership patented process to avoid write cliff vRAID second highest consumer of flash behind Apple custom RAID 3 next two generations of flash already in Violin engineers’ hands better garbage collection longer life span of flash world's two largest consumers of flash, Apple and Violin sustainable performance TPC rules require failure tolerance in storage layer only all-flash vendor with TPC-C and TPC-E results one block of storage linear scaling any number of LUNs, files, volumes, users, queries..

6 New Servers; Old Storage
Many threads Many cores Many time slices 10-40% utilization Why is the CPU not at 100%? -> IO Wait PCIe cards - flash board straight in the PCIe bus lowest possible latencies high failure rate - compared to enterprise class hard drives relatively inexpensive read mostly workload can't scale past the number of x8 slots in the host chassis not sharable crashed card loses data can't reboot host until card replaced any flash level error crashes host suffers from write cliff uses host CPU's - drains effectiveness of host applications RAID over many cards multiplies chances of write cliff hitting transaction required OS level RAID increases boot time (possibly increases RTO - recovery time objective) uses host RAM flash board with controller in a hard drive form factor SSDs slower than PCIe controller removes failure crash problem RAID means multiplication of write cliff likelihood requires RAID hard drive form factor means must segregate storage (data vs logs vs tempdb) chassis controller not build for SSD speed sharable with hard drive based chassis - wasted space on most expensive tier 24 units of 200GB models = 4.8TB before adding RAID and separating data from logs and tempdb no more than 24 SSD's per 2U chassis sharable storage All-flash arrays entirely built for flash technology each vendor has their own story VIMM chip as foundation Violin patented write cliff avoidance 4k block device 12 RAID groups of 5 VIMMS VIMM technology PCIe direct attach or sharable as SAN RAID already done inside of box retains speed after failure hot swapable components hot spares spreads data over entire array host IO parsed into 4k chuncks (one 64k IO becomes sixteen 4k blocks) massive parallelization initial formatting randomizes 4k blocks 4k chunks written as 1k each to 5 VIMMS (4k + 1k of parity) replaceable components no single point of failure 1k chunk to VIMM is parsed and written at the bit level simultaneously relationship with toshiba – strategic partnership patented process to avoid write cliff vRAID second highest consumer of flash behind Apple custom RAID 3 next two generations of flash already in Violin engineers’ hands better garbage collection longer life span of flash world's two largest consumers of flash, Apple and Violin sustainable performance TPC rules require failure tolerance in storage layer only all-flash vendor with TPC-C and TPC-E results one block of storage linear scaling any number of LUNs, files, volumes, users, queries..

7 SQL Server Process Management
One SQL scheduler per logical core SQL scheduler time-slices between users PCIe cards - flash board straight in the PCIe bus lowest possible latencies high failure rate - compared to enterprise class hard drives relatively inexpensive read mostly workload can't scale past the number of x8 slots in the host chassis not sharable crashed card loses data can't reboot host until card replaced any flash level error crashes host suffers from write cliff uses host CPU's - drains effectiveness of host applications RAID over many cards multiplies chances of write cliff hitting transaction required OS level RAID increases boot time (possibly increases RTO - recovery time objective) uses host RAM flash board with controller in a hard drive form factor SSDs slower than PCIe controller removes failure crash problem RAID means multiplication of write cliff likelihood requires RAID hard drive form factor means must segregate storage (data vs logs vs tempdb) chassis controller not build for SSD speed sharable with hard drive based chassis - wasted space on most expensive tier 24 units of 200GB models = 4.8TB before adding RAID and separating data from logs and tempdb no more than 24 SSD's per 2U chassis sharable storage All-flash arrays entirely built for flash technology each vendor has their own story VIMM chip as foundation Violin patented write cliff avoidance 4k block device 12 RAID groups of 5 VIMMS VIMM technology PCIe direct attach or sharable as SAN RAID already done inside of box retains speed after failure hot swapable components hot spares spreads data over entire array host IO parsed into 4k chuncks (one 64k IO becomes sixteen 4k blocks) massive parallelization initial formatting randomizes 4k blocks 4k chunks written as 1k each to 5 VIMMS (4k + 1k of parity) replaceable components no single point of failure 1k chunk to VIMM is parsed and written at the bit level simultaneously relationship with toshiba – strategic partnership patented process to avoid write cliff vRAID second highest consumer of flash behind Apple custom RAID 3 next two generations of flash already in Violin engineers’ hands better garbage collection longer life span of flash world's two largest consumers of flash, Apple and Violin sustainable performance TPC rules require failure tolerance in storage layer only all-flash vendor with TPC-C and TPC-E results one block of storage linear scaling any number of LUNs, files, volumes, users, queries..

8 Queues Running Waiting Runnable Currently executing process
Waiting for a resource (IO, network, locks, latches, etc) Runnable Resource ready, waiting to get on CPU PCIe cards - flash board straight in the PCIe bus lowest possible latencies high failure rate - compared to enterprise class hard drives relatively inexpensive read mostly workload can't scale past the number of x8 slots in the host chassis not sharable crashed card loses data can't reboot host until card replaced any flash level error crashes host suffers from write cliff uses host CPU's - drains effectiveness of host applications RAID over many cards multiplies chances of write cliff hitting transaction required OS level RAID increases boot time (possibly increases RTO - recovery time objective) uses host RAM flash board with controller in a hard drive form factor SSDs slower than PCIe controller removes failure crash problem RAID means multiplication of write cliff likelihood requires RAID hard drive form factor means must segregate storage (data vs logs vs tempdb) chassis controller not build for SSD speed sharable with hard drive based chassis - wasted space on most expensive tier 24 units of 200GB models = 4.8TB before adding RAID and separating data from logs and tempdb no more than 24 SSD's per 2U chassis sharable storage All-flash arrays entirely built for flash technology each vendor has their own story VIMM chip as foundation Violin patented write cliff avoidance 4k block device 12 RAID groups of 5 VIMMS VIMM technology PCIe direct attach or sharable as SAN RAID already done inside of box retains speed after failure hot swapable components hot spares spreads data over entire array host IO parsed into 4k chuncks (one 64k IO becomes sixteen 4k blocks) massive parallelization initial formatting randomizes 4k blocks 4k chunks written as 1k each to 5 VIMMS (4k + 1k of parity) replaceable components no single point of failure 1k chunk to VIMM is parsed and written at the bit level simultaneously relationship with toshiba – strategic partnership patented process to avoid write cliff vRAID second highest consumer of flash behind Apple custom RAID 3 next two generations of flash already in Violin engineers’ hands better garbage collection longer life span of flash world's two largest consumers of flash, Apple and Violin sustainable performance TPC rules require failure tolerance in storage layer only all-flash vendor with TPC-C and TPC-E results one block of storage linear scaling any number of LUNs, files, volumes, users, queries..

9 Bottleneck Triage Waits: What is keeping the SQL engine from continuing? Application (SQL) Hardware Architecture Tuning Production WaitType Wait_S Resource_S Signal_S WaitCount Percentage AvgWait_S AvgRes_S AvgSig_S BROKER_RECEIVE_WAITFOR 661.36 4 44.6 LCK_M_IS 139.46 139.35 0.11 489 9.4 0.2852 0.285 0.0002 LCK_M_X 96.86 96.54 0.32 373 6.53 0.2597 0.2588 0.0009 LCK_M_U 83.93 83.91 0.02 32 5.66 2.6227 2.6221 0.0006 PAGEIOLATCH_SH 83.92 83.84 0.08 9835 0.0085 LCK_M_S 82.44 82.1 0.33 419 5.56 0.1967 0.1959 0.0008 ASYNC_NETWORK_IO 54.4 53.61 0.79 33146 3.67 0.0016 ASYNC_IO_COMPLETION 43.1 37 2.91 1.1649 BACKUPIO 42.22 42.19 0.03 12607 2.85 0.0033 BACKUPBUFFER 36.64 36.48 0.15 2175 2.47 0.0168 0.0001 LCK_M_IX 30.88 30.85 130 2.08 0.2376 0.2373 0.0003 IO_COMPLETION 28.12 28.11 0.01 2611 1.9 0.0108 CXPACKET 23.27 21.6 1.67 3542 1.57 0.0066 0.0061 0.0005 PREEMPTIVE_OS_CREATEFILE 18.84 247 1.27 0.0763 PCIe cards - flash board straight in the PCIe bus lowest possible latencies high failure rate - compared to enterprise class hard drives relatively inexpensive read mostly workload can't scale past the number of x8 slots in the host chassis not sharable crashed card loses data can't reboot host until card replaced any flash level error crashes host suffers from write cliff uses host CPU's - drains effectiveness of host applications RAID over many cards multiplies chances of write cliff hitting transaction required OS level RAID increases boot time (possibly increases RTO - recovery time objective) uses host RAM flash board with controller in a hard drive form factor SSDs slower than PCIe controller removes failure crash problem RAID means multiplication of write cliff likelihood requires RAID hard drive form factor means must segregate storage (data vs logs vs tempdb) chassis controller not build for SSD speed sharable with hard drive based chassis - wasted space on most expensive tier 24 units of 200GB models = 4.8TB before adding RAID and separating data from logs and tempdb no more than 24 SSD's per 2U chassis sharable storage All-flash arrays entirely built for flash technology each vendor has their own story VIMM chip as foundation Violin patented write cliff avoidance 4k block device 12 RAID groups of 5 VIMMS VIMM technology PCIe direct attach or sharable as SAN RAID already done inside of box retains speed after failure hot swapable components hot spares spreads data over entire array host IO parsed into 4k chuncks (one 64k IO becomes sixteen 4k blocks) massive parallelization initial formatting randomizes 4k blocks 4k chunks written as 1k each to 5 VIMMS (4k + 1k of parity) replaceable components no single point of failure 1k chunk to VIMM is parsed and written at the bit level simultaneously relationship with toshiba – strategic partnership patented process to avoid write cliff vRAID second highest consumer of flash behind Apple custom RAID 3 next two generations of flash already in Violin engineers’ hands better garbage collection longer life span of flash world's two largest consumers of flash, Apple and Violin sustainable performance TPC rules require failure tolerance in storage layer only all-flash vendor with TPC-C and TPC-E results one block of storage linear scaling any number of LUNs, files, volumes, users, queries..

10 Buffer / RAM SQL Buffer PLE: Page Life Expectancy Working Set
MRU – LRU chain Pages move down chain until they fall off the end PLE: Page Life Expectancy How long a page lives in the chain before being cycled out MSFT recommends over 300 Working Set How much data do you want/NEED in RAM? Database size isn’t relevant What’s being worked on now, what will be in the near future? Workload profile What data are the users hitting? Is there any way to predict the future data page hits? PCIe cards - flash board straight in the PCIe bus lowest possible latencies high failure rate - compared to enterprise class hard drives relatively inexpensive read mostly workload can't scale past the number of x8 slots in the host chassis not sharable crashed card loses data can't reboot host until card replaced any flash level error crashes host suffers from write cliff uses host CPU's - drains effectiveness of host applications RAID over many cards multiplies chances of write cliff hitting transaction required OS level RAID increases boot time (possibly increases RTO - recovery time objective) uses host RAM flash board with controller in a hard drive form factor SSDs slower than PCIe controller removes failure crash problem RAID means multiplication of write cliff likelihood requires RAID hard drive form factor means must segregate storage (data vs logs vs tempdb) chassis controller not build for SSD speed sharable with hard drive based chassis - wasted space on most expensive tier 24 units of 200GB models = 4.8TB before adding RAID and separating data from logs and tempdb no more than 24 SSD's per 2U chassis sharable storage All-flash arrays entirely built for flash technology each vendor has their own story VIMM chip as foundation Violin patented write cliff avoidance 4k block device 12 RAID groups of 5 VIMMS VIMM technology PCIe direct attach or sharable as SAN RAID already done inside of box retains speed after failure hot swapable components hot spares spreads data over entire array host IO parsed into 4k chuncks (one 64k IO becomes sixteen 4k blocks) massive parallelization initial formatting randomizes 4k blocks 4k chunks written as 1k each to 5 VIMMS (4k + 1k of parity) replaceable components no single point of failure 1k chunk to VIMM is parsed and written at the bit level simultaneously relationship with toshiba – strategic partnership patented process to avoid write cliff vRAID second highest consumer of flash behind Apple custom RAID 3 next two generations of flash already in Violin engineers’ hands better garbage collection longer life span of flash world's two largest consumers of flash, Apple and Violin sustainable performance TPC rules require failure tolerance in storage layer only all-flash vendor with TPC-C and TPC-E results one block of storage linear scaling any number of LUNs, files, volumes, users, queries..

11 Visualizing Latency 8ms Time HDD Storage 8ms latency (0.008)
Block of data requested HDD Storage 8ms latency (0.008) 8ms Time Block of data returned Total latency = seek time + rotational latency I/O Bound Apps Flash Storage 250us latency ( ) Time

12 Storage Enables Applications

13 Accelerate Workloads & Save Money
Before Violin After Violin Accelerate I/O Wait 4 CPUs 1 CPU Consolidate

14 Compounding issues Adding more cores increases licensing costs
Faster cores still have to wait for blocking items Probably faster, definitely more expensive Best to find the bottleneck and solve it versus buying more/faster CPU’s

15 What makes Flash so great for SQL Server?
But it’s not just latency. It’s also parallelism. Most customers can’t fully employ SQL parallelism because their storage is underpowered. Modern servers have 16+ cores. Flash allows massive parallelism of storage access, delivering exponential performance improvement!

16 Is data reduction useful for SQL Server?
Gartner says 2-4X data reduction is common for databases Each data block contains a unique header, so deduplication inside a single database is low—even with large block sizes The bulk of data reduction will come from compression (More on this, next) Array-based data reduction technologies increase storage latency Data reduction requires CPU resources. Will this be at the host or array? Licensing costs may come into play if on the host. Compression in SQL Server is free in the Enterprise edition

17 Data Reduction for SQL Databases
Compression provides reduction value Deduplication does not provide much value. All database blocks unique Not for OLTP Data reduction good for reporting Use SQL Server’s internal compression: (free with Enterprise) : Gartner Data Center Conference December 2014

18 SQL Compression vs Array Compression
OLTP nearly as fast with SQL Compression as no compression SQL Page compression dedups per field, then does compression Arrays deduplicate on whole 4k block, then do compression CPU overhead for SQL Compression is single digit percent. Transactions Per Second (k) Threads

19 Architecting for Windows & SQL Server: 1
One I/O thread in Windows per LUN Need multiple LUNs for optimal performance Want parallelization through NTFS layer Use this to explain to customers why many LUNs are good = Windows needs more than one I/O thread

20 NTFS Tuning – Want Multiple LUNs
One I/O thread per LUN in Windows Administrators moving files = faster with many LUNs One LUN 2 file copies, split use of 1 I/O thread Two LUNs 2 file copies, each have their own I/O thread = twice as fast

21 Architecting for Windows & SQL Server: 2
Separate workloads by LUNs Optimize threads Minimize latency for small packets SQL workloads: Data (70/30 : 8k & 64k) Log (100% write 1-4k) Tempdb (50/50 : 64k)

22 Architecting for Windows & SQL Server: 3
MPIO (Multi-pathing) Total paths = 1 path per host ports times array ports. Typical client host has one dual-port card hitting four ports on active gateway: x 4 = 8 With 2 switches, paths go down by half Want 4 paths per 8Gb Fibre and 10Gb iSCSI port Want 8 paths for 16Gb Fibre port Use more LUNs if cannot get more ports

23 Architecting for Windows & SQL Server: 4
Two switches Common in production Splits the host ports into two separate zones Divides the total number of paths in half 1 host port times 2 array ports = 2 paths Will not be able to use full performance of 8Gb or 16Gb Fibre ports

24 Architecting for I/O Rule of Many: At every layer utilize many of each component to reduce bottlenecks Database files Virtual disks MPIO paths Physical ports LUNs Parallelization: Use many objects and many processes to increase parallel workloads Spread transactions over pages Use a switch (path multiplier) Use several LUNs Increase MAXDOP (and test) I/O Latency: Is Sacred. Don’t add anything to the I/O path that doesn’t need to be there LVM (Logical Volume Manager) Virtualization Compression De-dup

25 2012 2008 R2 Perfmon Performance Monitor
Ships with all versions of Windows Type “perf” into search window 2008 R2

26 Live Stats Stats refresh live about once per second
Latency only goes down to the millisecond (0.000)

27 Recommended Stats Works for SQL or any generic Windows host.
We do not have Exchange, SharePoint or Hyper-V specific stats

28 Perfmon Recommendations
Sample size: 5 seconds Run duration: 24 hours = minimum, 3 days = better, 1 week = best Capture heaviest workload days and times. Beginning/end of month. Backups. Weekend vs weekdays. One per Windows/SQL host Include all volumes & mount points Get the customer to document what each volume/mount point hosts (database data, logs, etc).

29 DiskSpd: New Tool Replaces SQLIO
Microsoft needed new tool. SQLIO could not handle modern SAN benchmarking Unique file data & unique block for writes (needed for deduplication based arrays) Read / Write mixed workloads (SQLIO was read-only or write-only) Higher performance & NUMA aware Advance reporting (latency distribution, stats per file and per workload) DiskSpd is a command line tool, scriptable See new white paper on DiskSpd and SAN benchmarking best practices

30 Real Database Workloads / Benchmark Testing
Data access: OLTP: 70/30 ratio 8k batches DW/BI: 70/30 ratio 64k/128k batches Tempdb: 50/50 ratio 64k batches Log: 100 writes 4k batches Beware of 4k, 100% read “hero” tests Latency accelerates applications, IOPs is just how many apps can be hosted Choice is a good thing. Dedup always-on, not so much Beware of insufficient hardware (CPU, FC or iSCSI ports, etc) SAN features versus SQL or 3rd party features

31 Questions ?


Download ppt "Flash Storage 101 Matt Henderson."

Similar presentations


Ads by Google