Storage Systems CSE 598d, Spring 2007 Lecture ?: Rules of thumb in data engineering Paper by Jim Gray and Prashant Shenoy Feb 15, 2007
Contents Examination of rules-of-thumb in data engineering –Moore’s law –Amdahl’s rules –Gilder’s law Technological trends and how/whether existing rules-of-thumb need to be re-thought
Moore’s Law Circuit densities grow at 4x every 3 years –100x increase in a decade –More generally: Ax every B years –Originally meant for RAM Implies an extra bit of addressing every 18 months From 16-bit of addressing in 70s (1 MB) to 64-bit addressing these days (several GB) –Extended to CPU and storage
Disk parameters over time
Moore’s law applied to HDD Disk capacity has increased more than 100x in the last decade! –Areal density up from 20 Mbpsi to 35 Gbpsi However, data rate has only increased 30x –Capacity / Accesses per sec growing 10x per decade –Capacity / bandwidth growing 10x per decade Implications: –Disk accesses becoming more precious –Disk data becoming “cooler”
Closer look at the implications Discussion –Does the increase in disk capacity mean applications are also using correspondingly large stores? –Why are disk accesses per second going up? Recall these have grown slower than areal density 10 years ago: 30 kaps for 1 GB data Today: 120 kaps for 80 GB data –That is, only 1.5 kaps per GB –HDD data needs to be x cooler than it was 10 years ago –Use large main memories (caching)
Costly disk accesses have led to.. Preferring few large transfers over many small ones Preferring sequential transfers –Log-structured file systems Mirroring rather than other forms of redundancy
Cost trends Historically –Tape:HDD:RAM has been 1:10:1000 Calculation for a modern system gives 1:3:300 –Disk prices are approaching tape prices Disks are replacing tapes in several domains –Cost/MB for RAM declines 100x in a decade What is economical to put on disk today may be economical to put on RAM in 10 years –RAM taking up lot of the role of the HDD, HDD taking up a lot of the role of tape Storage management costs exceed device costs Admins required to manage more and more data –Automation, self-manageability becoming crucial
Amhdal’s System Balance Rules Parallelism law –Expresses maximum achievable speedup in terms of the fraction of parallelizable component of a computation Balanced system law –A system needs 1 bit of IO/sec per instruction/sec IOPS = IPS Memory law –MB/MIPS ratio in a balanced system is 1 IO law –Programs do IO per instructions How have these rules changed over time?
Methodology –Rely on well-regarded benchmarks TPC-C (random) and TPC-H (sequential) Revisions to Amhdal’s laws –Balanced system law: Measure instruction rate and IO rate on relevant workload –Memory law: MB to MIPS ratio rising from 1 to 4 Re-iteration of the growth in RAM as disk IOs become expensive –IO law: Workload dependent instructions per IO was geared toward random IO Increased sequentiality (discussed earlier) in disk accesses means higher instructions per IO
Gilder’s Law Network bandwidth would triple every year for the next 25 years (prediction in 1995) Link bandwidth triples every four years Network messages used to cost more instructions and IO instructions per byte than disk –Network protocol processing overheads –These overheads have been reduced due to smarter NICs Cost comparison –Cost of moving data over WAN much more expensive than from local disk over LAN Related: Cost of shipping large disk arrays or entire computers comparable to the cost of data transfer over the Internet –However, this price gap likely to decline soon and bandwidth would be plentiful within a decade Implication: Local disks could then be used as caches (or pre-fetch buffers) with the main data store being remote –Save on local storage management costs –Managed data center model - is already seen!
Caching 5 minute rule for random workloads 1 minute rule for sequential worloads Web caches –Cache everything!