1 Designing for 20TB Disk Drives And enterprise storage Jim Gray, Microsoft research.

Slides:



Advertisements
Similar presentations
Symantec 2010 Windows 7 Migration EMEA Results. Methodology Applied Research performed survey 1,360 enterprises worldwide SMBs and enterprises Cross-industry.
Advertisements

Symantec 2010 Windows 7 Migration Global Results.
Meet Hadoop Doug Cutting & Eric Baldeschwieler Yahoo!
1 Storage Bricks Jim Gray Microsoft Research FAST 2002 Monterey, CA, 29 Jan 2002 Acknowledgements : Dave Patterson.
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
GigaByte TeraByte PetaByte ExaByte In Search of PetaByte Databases Jim Gray Tony Hey.
Computer Technology Forecast Jim Gray Microsoft Research
Clustering Technology For Scaleability Jim Gray Microsoft Research
1 Store Everything Online In A Database Jim Gray Microsoft Research
1 The 5 Minute Rule Jim Gray Microsoft Research Kilo10 3 Mega10 6 Giga10 9 Tera10 12 today,
Data Centric Computing
Year 6 mental test 10 second questions
Chapter 6 File Systems 6.1 Files 6.2 Directories
1 DDS Xpress Digital Data Storage Solution. 2 Long-term Goal Legacy Telecoms switches are still operational Expected lifespan at least another 10 years.
Cache Storage For the Next Billion Students: Anirudh Badam, Sunghwan Ihm Research Scientist: KyoungSoo Park Presenter: Vivek Pai Collaborator: Larry Peterson.
Our Digital World Second Edition
Basic Computer Vocabulary
SE-292 High Performance Computing
CS 105 Tour of the Black Holes of Computing
88 CHAPTER SECONDARY STORAGE. © 2005 The McGraw-Hill Companies, Inc. All Rights Reserved. 8-2 Competencies Distinguish between primary & secondary storage.
DiskCon September 2004 Solid State Disks: The Future of Storage?
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Copyright © 2006 by The McGraw-Hill Companies,
4.1 © 2004 Pearson Education, Inc. Exam Managing and Maintaining a Microsoft® Windows® Server 2003 Environment Lesson 4: Organizing a Disk for Data.
Your Interactive Guide to the Digital World Discovering Computers 2012.
Lesson 8: Creating and Configuring Virtual Machine Storage
Fast Crash Recovery in RAMCloud
Living in a Digital World Discovering Computers 2010.
Discovering Computers Fundamentals, 2012 Edition
CS 440 Database Management Systems RDBMS Architecture and Data Storage 1.
Mehdi Naghavi Spring 1386 Operating Systems Mehdi Naghavi Spring 1386.
1 Disks Introduction ***-. 2 Disks: summary / overview / abstract The following gives an introduction to external memory for computers, focusing mainly.
The IP Revolution. Page 2 The IP Revolution IP Revolution Why now? The 3 Pillars of the IP Revolution How IP changes everything.
Storing Data Chapter 4.
Network, Local, and Portable Storage Media Computer Literacy for Education Majors.
Describing Storage Devices Store data when computer is off Two processes –Writing data –Reading data Storage terms –Media is the material storing data.
The Device Revolution Building The Next Generation Infrastructure Mohamed A. Gawdat Regional Manager Communications & Mobile Devices Division Middle East.
I/O Systems.
Discovering Computers Fundamentals, 2012 Edition
WebCafé Slide No:1 World Cyber Cafe Association Brings to You Webcafe A Cyber Café Management Software A Software That Will Boost Your Efficiency For Managing.
1 Sizing the Streaming Media Cluster Solution for a Given Workload Lucy Cherkasova and Wenting Tang HPLabs.
Storage and Disks.
© 2012 National Heart Foundation of Australia. Slide 2.
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Introduction to Computer Administration Introduction.
Hardware & the Machine room Week 5 – Lecture 1. What is behind the wall plug for your workstation? Today we will look at the platform on which our Information.
© 2004 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Volume Concepts HP Restricted Module.
SE-292 High Performance Computing
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
SE-292 High Performance Computing Memory Hierarchy R. Govindarajan
PSSA Preparation.
1 Magnetic Disks 1956: IBM (RAMAC) first disk drive 5 Mb – Mb/in $/year 9 Kb/sec 1980: SEAGATE first 5.25’’ disk drive 5 Mb – 1.96 Mb/in2 625.
CS597A: Managing and Exploring Large Datasets Kai Li.
Introduction to Computer Terminology
“Five minute rule ten years later and other computer storage rules of thumb” Authors: Jim Gray, Goetz Graefe Reviewed by: Nagapramod Mandagere Biplob Debnath.
The Cost of Storage about 1K$/TB 12/1/1999 9/1/2000 9/1/2001 4/1/2002.
Computers in the real world Objectives Understand what is meant by memory Difference between RAM and ROM Look at how memory affects the performance of.
The Dawning of the Age of Infinite Storage William Perrizo Dept of Computer Science North Dakota State Univ.
Computers Central Processor Unit. Basic Computer System MAIN MEMORY ALUCNTL..... BUS CONTROLLER Processor I/O moduleInterconnections BUS Memory.
Chapter 111 Chapter 11: Hardware (Slides by Hector Garcia-Molina,
Section 1 # 1 CS The Age of Infinite Storage.
Eng.Abed Al Ghani H. Abu Jabal Introduction to computers.
Section 1 # 1 CS The Age of Infinite Storage.
1 Store Everything Online In A Database Jim Gray Microsoft Research
Computer Guts and Operating Systems CSCI 101 Week Two.
1 Rules of Thumb in Data Engineering Jim Gray University of Illinois at Urbana Champaign 23 April 2001
Storage Systems CSE 598d, Spring 2007 Lecture ?: Rules of thumb in data engineering Paper by Jim Gray and Prashant Shenoy Feb 15, 2007.
1 Meta-Message: Technology Ratios Matter Price and Performance change. If everything changes in the same way, then nothing really changes. If some things.
How much information? Adapted from a presentation by:
CS The Age of Infinite Storage
Jim Gray Microsoft Research
Presentation transcript:

1 Designing for 20TB Disk Drives And enterprise storage Jim Gray, Microsoft research

2 Disk Evolution Capacity:100x in 10 years 1 TB 3.5 drive in TB? in 2012?! System on a chip High-speed SAN Disk replacing tape Disk is super computer ! Kilo Mega Giga Tera Peta Exa Zetta Yotta

3 Disks are becoming computers Smart drives Camera with micro-drive Replay / Tivo / Ultimate TV Phone with micro-drive MP3 players Tablet Xbox Many more… Disk Ctlr + 1Ghz cpu+ 1GB RAM Comm: Infiniband, Ethernet, radio… Applications Web, DBMS, Files OS

4 Intermediate Step: Shared Logic Brick with 8-12 disk drives 200 mips/arm (or more) 2xGbpsEthernet General purpose OS 10k$/TB to 100k$/TB Shared Sheet metal Power Support/Config Security Network ports These bricks could run applications (e.g. SQL or Mail or..) Snap ~1TB 12x80GB NAS NetApp ~.5TB 8x70GB NAS Maxstor ~2TB 12x160GB NAS IBM TotalStorage ~360GB 10x36GB NAS

5 Hardware Homogenous machines leads to quick response through reallocation HP desktop machines, 320MB RAM, 3u high, 4 100GB IDE Drives $4k/TB (street), 2.5processors/TB, 1GB RAM/TB 3 weeks from ordering to operational Slide courtesy of Brewster Archive.org

6 Disk as Tape Tape is unreliable, specialized, slow, low density, not improving fast, and expensive Using removable hard drives to replace tapes function has been successful When a tape is needed, the drive is put in a machine and it is online. No need to copy from tape before it is used. Portable, durable, fast, media cost = raw tapes, dense. Unknown longevity: suspected good. Slide courtesy of Brewster Archive.org

7 Disk As Tape: What format? Today I send NTFS/SQL disks. But that is not a good format for Linux. Solution: Ship NFS/CIFS/ODBC servers (not disks) Plug disk into LAN. DHCP then file or DB server via standard interface. Web Service in long term

8 State is Expensive Stateless clones are easy to manage App servers are middle tier Cost goes to zero with Moores law. One admin per 1,000 clones. Good story about scaleout. Stateful servers are expensive to manage 1TB to 100TB per admin Storage cost is going to zero(2k$ to 200k$). Cost of storage is management cost

9 Databases (== SQL) VLDB survey (Winter Corp). 10 TB to 100TB DBs. Size doubling yearly Riding disk Moores law 10,000 disks at 18GB is 100TB cooked. Mostly DSS and data warehouses. Some media managers

10 Interesting facts No DBMSs beyond 100TB. Most bytes are in files. The web is file centric is file centric. Science (and batch) is file centric. But…. SQL performance is better than CIFS/NFS.. CISC vs RISC

11 BarBar: the biggest DB 500 TB Uses Objectivity SLAC events Linux cluster scans DB looking for patterns

TB (cooked) Hotmail / Yahoo Clone front ends Application servers hotmail Get mail box Get/put mail Disk bound ~30,000 disks ~ 20 admins

13 AOL (msn) (1PB?) 10 B transactions per day (10% of that) Huge storage Huge traffic Lots of eye candy DB used for security/accounting. GUESS AOL is a petabyte (40M x 10MB = 400 x )

14 Google 1.5PB as of last spring 8,000 no-name PCs Each 1/3U, 2 x 80 GB disk, 2 cpu 256MB ram 1.4 PB online. 2 TB ram online 8 TeraOps Slice-price is 1K$ so 8M$. 15 admins (!) (== 1/100TB).

15 Astronomy Ive been trying to apply DB to astronomy Today they are at 10TB per data set Heading for Petabytes Using Objectivity Trying SQL (talk to me offline)

16 Scale Out: Buy Computing by the Slice 709,202 tpmC! == 1 Billion transactions/day Slice: 8cpu, 8GB, 100 disks (=1.8TB) 20ktpmC per slice, ~300k$/slice clients and 4 DTC nodes not shown

17 ScaleUp: A Very Big System! UNISYS Windows 2000 Data Center Limited Edition 32 cpus on 32 GB of RAM and 1,061 disks (15.5 TB) Will be helped by 64bit addressing 24 fiber channel

18 Hardware SQL\Inst1 SQL\Inst2 SQL\Inst3 Spare F G L KPQE E JJ O O I H M N R S One SQL database per rack Each rack contains 4.5 tb 261 total drives / 13.7 TB total Meta Data Meta Data Stored on 101 GB Fast, Small Disks (18 x 18.2 GB) Imagery Data Imagery Data Stored on GB Slow, Big Disks (15 x 73.8 GB) To Add GB Disks in Feb 2001 to create 18 TB SAN 8 Compaq DL360 Photon Web Servers Fiber SAN Switches 4 Compaq ProLiant 8500 Db Servers

19 Amdahls Balance Laws parallelism law: If a computation has a serial part S and a parallel component P, then the maximum speedup is (S+P)/S. balanced system law: A system needs a bit of IO per second per instruction per second: about 8 MIPS per MBps. memory law: = 1: the MB/MIPS ratio (called alpha ( )), in a balanced system is 1. IO law: Programs do one IO per 50,000 instructions.

20 Amdahls Laws Valid 35 Years Later? Parallelism law is algebra: so SURE! Balanced system laws? Look at tpc results (tpcC, tpcH) at Some imagination needed: Whats an instruction (CPI varies from 1-3)? RISC, CISC, VLIW, … clocks per instruction,… Whats an I/O?

21 Disks / cpu TPC systems Normalize for CPI (clocks per instruction) TPC-C has about 7 ins/byte of IO TPC-H has 3 ins/byte of IO TPC-H needs ½ as many disks, sequential vs random Both use 9GB 10 krpm disks (need arms, not bytes) MHz/ cpu CPImips KB / IO IO/s / disk Disk s MB/s / cpu Ins/ IO Byte Amdahl TPC-C= random TPC-H= sequential

22 TPC systems: Whats alpha (=MB/MIPS ) ? Hard to say: Intel 32 bit addressing (= 4GB limit). Known CPI. IBM, HP, Sun have 64 GB limit. Unknown CPI. Look at both, guess CPI for IBM, HP, Sun Alpha is between 1 and 6 MipsMemory Alpha Amdahl11 1 tpcC Intel8x262 = 2Gips4GB 2 tpcH Intel8x458 = 4Gips4GB 1 tpcC IBM24 cpus ?= 12 Gips64GB 6 tpcH HP32 cpus ?= 16 Gips32 GB 2

23 Performance (on current SDSS data) Run times: on 15k$ COMPAQ Server (2 cpu, 1 GB, 8 disk) Some take 10 minutes Some take 1 minute Median ~ 22 sec. Ghz processors are fast! (10 mips/IO, 200 ins/byte) 2.5 m rec/s/cpu ~1,000 IO/cpu sec ~ 64 MB IO/cpu sec

24 How much storage do we need? Soon everything can be recorded and indexed Most bytes will never be seen by humans. Data summarization, trend detection anomaly detection are key technologies See Mike Lesk: How much information is there: See Lyman & Varian: How much information Yotta Zetta Exa Peta Tera Giga Mega Kilo A Book.Movi e All LoC books (words) All Books MultiMedia Everything ! Recorded A Photo 24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli

25 Standard Storage Metrics Capacity: RAM: MB and $/MB: today at 512MB and 200$/GB Disk:GB and $/GB: today at 80GB and 70k$/TB Tape: TB and $/TB: today at 40GB and 10k$/TB (nearline) Access time (latency) RAM: 100 ns Disk: 15 ms Tape: 30 second pick, 30 second position Transfer rate RAM: 1-10 GB/s Disk: MB/s - - -Arrays can go to 10GB/s Tape: 5-15 MB/s Arrays can go to 1GB/s

26 New Storage Metrics: Kaps, Maps, SCAN Kaps: How many kilobyte objects served per second The file server, transaction processing metric This is the OLD metric. Maps: How many megabyte objects served per sec The Multi-Media metric SCAN: How long to scan all the data the data mining and utility metric And Kaps/$, Maps/$, TBscan/$

27 More Kaps and Kaps/$ but…. Disk accesses got much less expensive Better disks Cheaper disks! But: disk arms are expensive the scarce resource 1 hour Scan vs 5 minutes in GB 30 MB/s

28 Data on Disk Can Move to RAM in 10 years 100:1 10 years

29 The Absurd 10x (=4 year) Disk 2.5 hr scan time (poor sequential access) 1 aps / 5 GB (VERY cold data) Its a tape! 1 TB 100 MB/s 200 Kaps

30 Its Hard to Archive a Petabyte It takes a LONG time to restore it. At 1GBps it takes 12 days! Store it in two (or more) places online (on disk?). A geo-plex Scrub it continuously (look for errors) On failure, use other copy until failure repaired, refresh lost copy from safe copy. Can organize the two copies differently (e.g.: one by time, one by space)

31 Auto Manage Storage 1980 rule of thumb: A DataAdmin per 10GB, SysAdmin per mips 2000 rule of thumb A DataAdmin per 5TB SysAdmin per 100 clones (varies with app). Problem: 5TB is 50k$ today, 5k$ in a few years. Admin cost >> storage cost !!!! Challenge: Automate ALL storage admin tasks

32 How to cool disk data: Cache data in main memory See 5 minute rule later in presentation Fewer-larger transfers Larger pages (512-> 8KB -> 256KB) Sequential rather than random access Random 8KB IO is 1.5 MBps Sequential IO is 30 MBps (20:1 ratio is growing) Raid1 (mirroring) rather than Raid5 (parity).

33 Data delivery costs 1$/GB today Rent for big customers: 300$/megabit per second per month Improved 3x in last 6 years (!). That translates to 1$/GB at each end. You can mail a 160 GB disk for 20$. Thats 16x cheaper If overnight its 4 MBps. 3x160 GB ~ ½ TB