1 Storage Bricks Jim Gray Microsoft Research FAST 2002 Monterey, CA, 29 Jan 2002 Acknowledgements : Dave Patterson explained this to me long ago Leonard Chung Kim Keeton Erik Riedel Catharine Van Ingen Helped me sharpen these arguments
2 First Disk 1956 IBM 305 RAMAC 4 MB 50x24 disks 1200 rpm 100 ms access 35k$/y rent Included computer & accounting software (tubes not transistors)
3 10 years later 1.6 meters
4 Disk Evolution Capacity:100x in 10 years 1 TB 3.5 drive in GB 1 micro-drive System on a chip High-speed SAN Disk replacing tape Disk is super computer! Kilo Mega Giga Tera Peta Exa Zetta Yotta
5 Disks are becoming computers Smart drives Camera with micro-drive Replay / Tivo / Ultimate TV Phone with micro-drive MP3 players Tablet Xbox Many more… Disk Ctlr + 1Ghz cpu+ 1GB RAM Comm: Infiniband, Ethernet, radio… Applications Web, DBMS, Files OS
6 Data Gravity Processing Moves to Transducers smart displays, microphones, printers, NICs, disks Storage Network Display ASIC Today: P=50 mips M= 2 MB In a few years P= 500 mips M= 256 MB Processing decentralized Moving to data sources Moving to power sources Moving to sheet metal ? The end of computers ?
7 Its Already True of Printers Peripheral = CyberBrick You buy a printer You get a –several network interfaces –A Postscript engine cpu, memory, software, a spooler (soon) –and… a print engine.
8 The Absurd Design? Segregate processing from storage Poor locality Much useless data movement Amdahls laws: bus: 10 B/ips io: 1 b/ips Processors Disks ~ 1 Tips RAM ~ 1 TB ~ 100TB 100 GBps 10 TBps
9 The Absurd Disk 2.5 hr scan time (poor sequential access) 1 aps / 5 GB (VERY cold data) Its a tape! Optimizations: –Reduce management costs –Caching –Sequential 100x faster than random 1 TB 100 MB/s 200 Kaps 200$
10 Disk = Node magnetic storage (1TB) processor + RAM + LAN Management interface (HTTP + SOAP) Application execution environment Application –File –DB2/Oracle/SQL –Notes/Exchange/ TeamServer –SAP/Seibold/… –Quickbooks /Tivo/ PC.… OS Kernel LAN driverDisk driver File SystemRPC,... ServicesDBMS Applications
11 Implications Offload device handling to NIC/HBA higher level protocols: I2O, NASD, VIA, IP, TCP… SMP and Cluster parallelism is important. Terabyte/s Backplane Move app to NIC/device controller higher-higher level protocols: SOAP/DCOM/RMI.. Cluster parallelism is VERY important. Central Processor & Memory ConventionalRadical
12 Intermediate Step: Shared Logic Brick with 8-12 disk drives 200 mips/arm (or more) 2xGbpsEthernet General purpose OS 10k$/TB to 50k$/TB Shared –Sheet metal –Power –Support/Config –Security –Network ports These bricks could run applications (e.g. SQL or Mail or..) Snap ~1TB 12x80GB NAS NetApp ~.5TB 8x70GB NAS Maxstor ~2TB 12x160GB NAS
13 Example Homogenous machines leads to quick response through reallocation HP desktop machines, 320MB RAM, 3u high, 4 100GB IDE Drives $4k/TB (street), 2.5processors/TB, 1GB RAM/TB JIT storage & processing 3 weeks from order to deploy Slide courtesy of Brewster Archive.org
14 What if Disk Replaces Tape? How does it work? Backup/Restore –RAID (among the federation) –Snapshot copies (in most OSs) –remote replicas (standard in DBMS and FS) Archive –Use cold 95% of disk space Interchange –Send computers not disks.
15 Its Hard to Archive a Petabyte It takes a LONG time to restore it. At 1GBps it takes 12 days! Store it in two (or more) places online A geo-plex Scrub it continuously (look for errors) On failure, –use other copy until failure repaired, –refresh lost copy from safe copy. Can organize the two copies differently (e.g.: one by time, one by space)
16 Archive to Disk 100TB for 0.5M$ free petabytes If you have 100 TB active you need 10,000 mirrored disk arms (see tpcC) So you have 1.6 PB of (mirrored) storage (160GB drives) Use the empty 95% for archive storage. No extra space or extra power cost. Very fast access (milliseconds vs hours). Snapshot is read-only (software enforced ) Makes Admin easy (saves people costs)
17 Disk as Tape Archive Tape is unreliable, specialized, slow, low density, not improving fast, and expensive Using removable hard drives to replace tapes function has been successful When a tape is needed, the drive is put in a machine and it is online. No need to copy from tape before it is used. Portable, durable, fast, media cost = raw tapes, dense. Unknown longevity: suspected good. Slide courtesy of Brewster Archive.org
18 Disk as Tape Interchange Tape interchange is frustrating (often unreadable) Beyond 1-10 GB send media not data –FTP takes too long (hour/GB) –Bandwidth still very expensive (1$/GB) Writing DVD not much faster than Internet New technology could change this –100 GB 10MBps would be competitive. Write 1TB disk in 2.5 hrs (at 100MBps) But, how does interchange work?
19 Disk As Tape Interchange: What format? Today I send 160GB NTFS/SQL disks. But that is not a good format for Linux/DB2 users. Solution: Ship NFS/CIFS/ODBC servers (not disks) Plug disk into LAN. –DHCP then file or DB server via standard interface. –pull data from server.
20 Some Questions What is the product? How do I manage 10,000 nodes (disks)? How do I program 10,000 nodes (disks)? How does RAID work? How do I backup a PB? How do I restore a PB?
21 What is the Product? Concept: Plug it in and it works! Music/Video/Photo appliance (home) Game appliance PC File server appliance Data archive/interchange appliance Web server appliance DB server appliance Application appliance power network
22 How Does Scale Out Work? Files: well known designs: –rooted tree partitioned across nodes –Automatic cooling (migration) –Mirrors or Chained declustering –Snapshots for backup/archive Databases: well known designs –Partitioning, remote replication similar to files –distributed query processing. Applications: (hypothetical) –Must be designed as mobile objects –Middleware provides object migration system Objects externalize methods to migrate ( == backup/restore/archive) Web services seem to have key ideas (xml representation) –Example: object is mailbox
23 Auto Manage Storage 1980 rule of thumb: –A DataAdmin per 10GB, SysAdmin per mips 2000 rule of thumb –A DataAdmin per 5TB –SysAdmin per 100 clones (varies with app). Problem: –5TB is 50k$ today, 5k$ in a few years. –Admin cost >> storage cost !!!! Challenge: –Automate ALL storage admin tasks
24 Admin: TB and guessed $/TB (does not include cost of application, overhead, not substance) Google:1 :100TB 5k$/TB/y Yahoo!1 : 50TB 20k$/TB/y DB1 : 5TB 60k$/TB/y Wall St.1 : 1TB 400k$/TB/y (reported) hardware dominant cost Google. How can we waste hardware to save people cost?
25 How do I manage 10,000 nodes? You cant manage 10,000 x (for any x). They manage themselves. –You manage exceptional exceptions. Auto Manage –Plug & Play hardware –Auto-load balance & placement storage & processing –Simple parallel programming model –Fault masking
26 How do I program 10,000 nodes? You cant program 10,000 x (for any x). They program themselves. –You write embarrassingly parallel programs –Examples: SQL, Web, Google, Inktomi, HotMail,…. –PVM and MPI prove it must be automatic (unless you have a PhD)! Auto Parallelism is ESSENTIAL
27 Summary Disks will become supercomputers so –Lots of computing to optimize the arm –Can put app close to the data (better modularity, locality) –Storage appliances (self-organizing) The arm/capacity tradeoff: waste space to save access. –Compression (saves bandwidth) –Mirrors –Online backup/restore –Online archive (vault to other drives or geoplex if possible) Not disks replace tapes: Storage appliances replace tapes. Self-organizing storage servers (file systems) (prototypes of this software exist)