Data Centric Computing Yotta Zetta Exa Peta Tera Giga Mega Kilo Jim Gray Microsoft Research Research.Microsoft.com/~Gray/talks FAST 2002 Monterey, CA, 14 Oct 1999
Sub-Title Put Everything in Future (Disk) Controllers (it’s not “if”, it’s “when?”) Jim Gray Microsoft Research http://Research.Micrsoft.com/~Gray/talks FAST 2002 Monterey, CA, 14 Oct 1999 Acknowledgements: Dave Patterson explained this to me long ago Leonard Chung Kim Keeton Erik Riedel Catharine Van Ingen BARC started in 1995 with Jim Gray and Gordon Bell. We are part of Microsoft Research with a focus on Scaleable Servers (Gray, Barrera, Barclay, Slutz, VanIngen) and Telepresence (Bell, Gemmell). In 1996 we grew to a staff of 6 and moved to our current location in downtown San Francisco (at the east end of Silicon Gulch). We have close ties to the SQL, MTS, NT, PowerPoint, and NetMeeting groups. We also collaborate with UC Berkeley, Cornell, and Wisconsin on Scaleable computing, with UC Berkeley and U. Virginia on Telepresence. Each summer we host two interns. Our web site is http://www.research.Microsoft.com/BARC BARC is located at 301 Howard St, #830, San Francisco CA.94105 Humor: our next door neighbor is the Justice Department (Environmental Division). So the sign in the lobby reads: Microsoft 830 <= Justice Department 870 => Helped me sharpen these arguments
First Disk 1956 IBM 305 RAMAC 4 MB 50x24” disks 1200 rpm 100 ms access 35k$/y rent Included computer & accounting software (tubes not transistors)
10 years later 1.6 meters
Disk Evolution Kilo Mega Giga Tera Peta Exa Zetta Yotta Capacity:100x in 10 years 1 TB 3.5” drive in 2005 20 GB as 1” micro-drive System on a chip High-speed SAN Disk replacing tape Disk is super computer!
Disks are becoming computers Smart drives Camera with micro-drive Replay / Tivo / Ultimate TV Phone with micro-drive MP3 players Tablet Xbox Many more… Applications Web, DBMS, Files OS Disk Ctlr + 1Ghz cpu+ 1GB RAM Comm: Infiniband, Ethernet, radio…
Data Gravity Processing Moves to Transducers smart displays, microphones, printers, NICs, disks Processing decentralized Moving to data sources Moving to power sources Moving to sheet metal ? The end of computers ? ASIC Today: P=50 mips M= 2 MB In a few years P= 500 mips M= 256 MB Storage Network Display
It’s Already True of Printers Peripheral = CyberBrick You buy a printer You get a several network interfaces A Postscript engine cpu, memory, software, a spooler (soon) and… a print engine.
The (absurd?) consequences of Moore’s Law 256 way nUMA? Huge main memories: now: 500MB - 64GB memories then: 10GB - 1TB memories Huge disks now: 20-200 GB 3.5” disks then: .1 - 1 TB disks Petabyte storage farms (that you can’t back up or restore). Disks >> tapes “Small” disks: One platter one inch 10GB SAN convergence 1 GBps point to point is easy 1 GB RAM chips MAD at 200 Gbpsi Drives shrink one quantum 10 GBps SANs are ubiquitous 1 bips cpus for 10$ 10 bips cpus at high end
The Absurd Design? Further segregate processing from storage Poor locality Much useless data movement Amdahl’s laws: bus: 10 B/ips io: 1 b/ips Disks RAM ~ 1 TB Processors 100 GBps 10 TBps ~ 1 Tips ~ 100TB
What’s a Balanced System? (40+ disk arms / cpu) System Bus PCI Bus
Amdahl’s Balance Laws Revised Laws right, just need “interpretation” (imagination?) Balanced System Law: A system needs 8 MIPS/MBpsIO, but instruction rate must be measured on the workload. Sequential workloads have low CPI (clocks per instruction), random workloads tend to have higher CPI. Alpha (the MB/MIPS ratio) is rising from 1 to 6. This trend will likely continue. One Random IO’s per 50k instructions. Sequential IOs are larger One sequential IO per 200k instructions
Observations re TPC C, H systems More than ½ the hardware cost is in disks Most of the mips are in the disk controllers 20 mips/arm is enough for tpcC 50 mips/arm is enough for tpcH Need 128MB to 256MB/arm Ref: Gray& Shenoy: “Rules of Thumb…” Keeton, Riedel, Uysal, PhD thesis. ? The end of computers ?
8 7 22 3 50 TPC systems Normalize for CPI (clocks per instruction) TPC-C has about 7 ins/byte of IO TPC-H has 3 ins/byte of IO TPC-H needs ½ as many disks, sequential vs random Both use 9GB 10 krpm disks (need arms, not bytes) MHz/ cpu CPI mips KB/ IO IO/s/ disk Disks Disks/ cpu MB/s/ cpu Ins/ IO Byte Amdahl 1 1 1 6 8 TPC-C= random 550 2.1 262 8 100 397 50 40 7 TPC-H= sequential 550 1.2 458 64 100 176 22 141 3
TPC systems: What’s alpha (=MB/MIPS)? Hard to say: Intel 32 bit addressing (= 4GB limit). Known CPI. IBM, HP, Sun have 64 GB limit. Unknown CPI. Look at both, guess CPI for IBM, HP, Sun Alpha is between 1 and 6 Mips Memory Alpha Amdahl 1 tpcC Intel 8x262 = 2Gips 4GB 2 tpcH Intel 8x458 = 4Gips tpcC IBM 24 cpus ?= 12 Gips 64GB 6 tpcH HP 32 cpus ?= 16 Gips 32 GB
When each disk has 1bips, no need for ‘cpu’
Implications Conventional Radical Move app to NIC/device controller higher-higher level protocols: CORBA / COM+. Cluster parallelism is VERY important. Offload device handling to NIC/HBA higher level protocols: I2O, NASD, VIA, IP, TCP… SMP and Cluster parallelism is important. Central Processor & Memory Terabyte/s Backplane
Interim Step: Shared Logic Brick with 8-12 disk drives 200 mips/arm (or more) 2xGbpsEthernet General purpose OS (except NetApp ) 10k$/TB to 50k$/TB Shared Sheet metal Power Support/Config Security Network ports Snap™ ~1TB 12x80GB NAS NetApp™ ~.5TB 8x70GB NAS Maxstor™ ~2TB 12x160GB NAS
Next step in the Evolution Disks become supercomputers Controller will have 1bips, 1 GB ram, 1 GBps net And a disk arm. Disks will run full-blown app/web/db/os stack Distributed computing Processors migrate to transducers.
Gordon Bell’s Seven Price Tiers 10$: wrist watch computers 100$: pocket/ palm computers 1,000$: portable computers 10,000$: personal computers (desktop) 100,000$: departmental computers (closet) 1,000,000$: site computers (glass house) 10,000,000$: regional computers (glass castle) Super-Server: Costs more than 100,000 $ “Mainframe” Costs more than 1M$ Must be an array of processors, disks, tapes comm ports
Bell’s Evolution of Computer Classes Technology enable two evolutionary paths: 1. constant performance, decreasing cost 2. constant price, increasing performance ?? Time Mainframes (central) Minis (dep’t.) PCs (personals) Log Price WSs 1.26 = 2x/3 yrs -- 10x/decade; 1/1.26 = .8 1.6 = 4x/3 yrs --100x/decade; 1/1.6 = .62
NAS vs SAN Network Attached Storage Storage Area Network File servers High level Interfaces are better Network Attached Storage File servers Database servers Application servers (it’s a slippery slope: as Novell showed) Storage Area Network A lower life form Block server: get block / put block Wrong abstraction level (too low level) Security is VERY hard to understand. (who can read that disk block?) SCSI and iSCSI are popular.
How Do They Talk to Each Other? Each node has an OS Each node has local resources: A federation. Each node does not completely trust the others. Nodes use RPC to talk to each other WebServices/SOAP? CORBA? COM+? RMI? One or all of the above. Huge leverage in high-level interfaces. Same old distributed system story. Applications Applications datagrams streams RPC ? ? RPC streams datagrams SIO SIO SAN
Basic Argument for x-Disks Future disk controller is a super-computer. 1 bips processor 256 MB dram 1 TB disk plus one arm Connects to SAN via high-level protocols RPC, HTTP, SOAP, COM+, Kerberos, Directory Services,…. Commands are RPCs management, security,…. Services file/web/db/… requests Managed by general-purpose OS with good dev environment Move apps to disk to save data movement need programming environment in “controller”
The Slippery Slope If you add function to server Nothing = Sector Server If you add function to server Then you add more function to server Function gravitates to data. Fixed App Server Something = Everything = App Server
Why Not a Sector Server? (let’s get physical!) Good idea, that’s what we have today. But cache added for performance Sector remap added for fault tolerance error reporting and diagnostics added SCSI commends (reserve,.. are growing) Sharing problematic (space mgmt, security,…) Slipping down the slope to a 2-D block server
Why Not a 1-D Block Server? Put A LITTLE on the Disk Server Tried and true design HSC - VAX cluster EMC IBM Sysplex (3980?) But look inside Has a cache Has space management Has error reporting & management Has RAID 0, 1, 2, 3, 4, 5, 10, 50,… Has locking Has remote replication Has an OS Security is problematic Low-level interface moves too many bytes
Why Not a 2-D Block Server? Put A LITTLE on the Disk Server Tried and true design Cedar -> NFS file server, cache, space,.. Open file is many fewer msgs Grows to have Directories + Naming Authentication + access control RAID 0, 1, 2, 3, 4, 5, 10, 50,… Locking Backup/restore/admin Cooperative caching with client
Why Not a File Server? Put a Little on the 2-D Block Server Tried and true design NetWare, Windows, Linux, NetApp, Cobalt, SNAP,... WebDav Yes, but look at NetWare File interface grew Became an app server Mail, DB, Web,…. Netware had a primitive OS Hard to program, so optimized wrong thing
Why Not Everything? Allow Everything on Disk Server (thin client’s) Tried and true design Mainframes, Minis, ... Web servers,… Encapsulates data Minimizes data moves Scaleable It is where everyone ends up. All the arguments against are short-term.
The Slippery Slope If you add function to server Nothing = Sector Server If you add function to server Then you add more function to server Function gravitates to data. Fixed App Server Something = Everything = App Server
Disk = Node has magnetic storage (1TB?) has processor & DRAM has SAN attachment has execution environment Applications Services DBMS RPC, ... File System SAN driver Disk driver OS Kernel
Hardware Homogenous machines leads to quick response through reallocation HP desktop machines, 320MB RAM, 3u high, 4 100GB IDE Drives $4k/TB (street), 2.5processors/TB, 1GB RAM/TB 3 weeks from ordering to operational Slide courtesy of Brewster Kahle, @ Archive.org
Disk as Tape Tape is unreliable, specialized, slow, low density, not improving fast, and expensive Using removable hard drives to replace tape’s function has been successful When a “tape” is needed, the drive is put in a machine and it is online. No need to copy from tape before it is used. Portable, durable, fast, media cost = raw tapes, dense. Unknown longevity: suspected good. Slide courtesy of Brewster Kahle, @ Archive.org
Disk As Tape: What format? Today I send NTFS/SQL disks. But that is not a good format for Linux. Solution: Ship NFS/CIFS/ODBC servers (not disks) Plug “disk” into LAN. DHCP then file or DB server via standard interface. Web Service in long term
Some Questions Will the disk folks deliver? What is the product? How do I manage 1,000 nodes (disks)? How do I program 1,000 nodes (disks)? How does RAID work? How do I backup a PB? How do I restore a PB?
Will the disk folks deliver? Maybe! Hard Drive Unit Shipments Source: DiskTrend/IDC Not a pretty picture (lately)
Most Disks are Personal 85% of disks are desktop/mobile (not SCSI) Personal media is AT LEAST 50% of the problem. How to manage your shoebox of: Documents Voicemail Photos Music Videos
What is the Product? (see next section on media management) Concept: Plug it in and it works! Music/Video/Photo appliance (home) Game appliance “PC” File server appliance Data archive/interchange appliance Web appliance Email appliance Application appliance Router appliance network power
Auto Manage Storage Admin cost >> storage cost !!!! 1980 rule of thumb: A DataAdmin per 10GB, SysAdmin per mips 2000 rule of thumb A DataAdmin per 5TB SysAdmin per 100 clones (varies with app). Problem: 5TB is 50k$ today, 5k$ in a few years. Admin cost >> storage cost !!!! Challenge: Automate ALL storage admin tasks
How do I manage 1,000 nodes? You can’t manage 1,000 x (for any x). They manage themselves. You manage exceptional exceptions. Auto Manage Plug & Play hardware Auto-load balance & placement storage & processing Simple parallel programming model Fault masking Some positive signs: Few admins at Google 10k nodes 2 PB , Yahoo! ? nodes, 0.3 PB, Hotmail 10k nodes, 0.3 PB
How do I program 1,000 nodes? You can’t program 1,000 x (for any x). They program themselves. You write embarrassingly parallel programs Examples: SQL, Web, Google, Inktomi, HotMail,…. PVM and MPI prove it must be automatic (unless you have a PhD)! Auto Parallelism is ESSENTIAL
Plug & Play Software RPC is standardizing: (SOAP/HTTP, COM+, RMI/IIOP) Gives huge TOOL LEVERAGE Solves the hard problems : naming, security, directory service, operations,... Commoditized programming environments FreeBSD, Linix, Solaris,…+ tools NetWare + tools WinCE, WinNT,…+ tools JavaOS + tools Apps gravitate to data. General purpose OS on dedicated ctlr can run apps.
It’s Hard to Archive a Petabyte It takes a LONG time to restore it. At 1GBps it takes 12 days! Store it in two (or more) places online (on disk?). A geo-plex Scrub it continuously (look for errors) On failure, use other copy until failure repaired, refresh lost copy from safe copy. Can organize the two copies differently (e.g.: one by time, one by space)
Disk vs Tape Disk Tape 160 GB 25 MBps 5 ms seek time 3 ms rotate latency 2$/GB for drive 1$/GB for ctlrs/cabinet 4 TB/rack Tape 100 GB 10 MBps 30 sec pick time Many minute seek time 5$/GB for media 10$/GB for drive+library 10 TB/rack Guestimates Cern: 200 TB 3480 tapes 2 col = 50GB Rack = 1 TB =20 drives The price advantage of tape is narrowing, and the performance advantage of disk is growing
I’m a disk bigot I hate tape, tape hates me. Disk Much easier to use Unreliable hardware Unreliable software Poor human factors Terrible latency, bandwidth Disk Much easier to use Much faster Cheaper! But needs new concepts
Disk as Tape Challenges Offline disk (safe from virus) Trivialize Backup/Restore software Things never change Just object versions Snapshot for continuous change (databases) RAID in a SAN (cross-disk journaling) Massive replication (a la Farsite)
Summary Disks will become supercomputers Compete in Linux appliance space Build best NAS software (compete with NetApp, ..) Auto-manage huge storage farms FarSite, SQL autoAdmin++,… Build world’s best disk-based backup system Including Geoplex (compete with Veritas,..) Push faster on 64-bit
Storage capacity beating Moore’s law 2 k$/TB today (raw disk) 1k$/TB by end of 2002
Trends: Magnetic Storage Densities Amazing progress Ratios have changed Capacity grows 60%/y Access speed grows 10x more slowly
Trends: Density Limits Density vs Time b/µm2 & Gb/in2 Bit Density The end is near! Products:23 Gbpsi Lab: 50 Gbpsi “limit”: 60 Gbpsi But limit keeps rising & there are alternatives b/µm2 Gb/in2 ?: NEMS, Florescent? Holographic, DNA? 3,000 2,000 1,000 600 300 200 SuperParmagnetic Limit 100 60 30 20 Wavelength Limit 10 6 ODD DVD 3 2 CD Figure adapted from Franco Vitaliano, “The NEW new media: the growing attraction of nonmagnetic storage”, Data Storage, Feb 2000, pp 21-32, www.datastorage.com 1 0.6 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008
CyberBricks Disks are becoming supercomputers. Each disk will be a file server then SOAP server Multi-disk bricks are transitional Long-term brick will have OS per disk. Systems will be built from bricks. There will also be Network Bricks Display Bricks Camera Bricks ….
Data Centric Computing Yotta Zetta Exa Peta Tera Giga Mega Kilo Jim Gray Microsoft Research Research.Microsoft.com/~Gray/talks FAST 2002 Monterey, CA, 14 Oct 1999
Communications Excitement!! Point-to-Point Broadcast lecture concert conversation money Net Work + DB Immediate Time Shifted mail book newspaper Data Base Its ALL going electronic Information is being stored for analysis (so ALL database) Analysis & Automatic Processing are being added Slide borrowed from Craig Mundie
Information Excitement! But comm just carries information Real value added is information capture & render speech, vision, graphics, animation, … Information storage retrieval, Information analysis
Information At Your Fingertips All information will be in an online database (somewhere) You might record everything you read: 10MB/day, 400 GB/lifetime (5 disks today) hear: 400MB/day, 16 TB/lifetime (2 disks/year today) see: 1MB/s, 40GB/day, 1.6 PB/lifetime (150 disks/year maybe someday) Data storage, organization, and analysis is challenge. text, speech, sound, vision, graphics, spatial, time… Information at Your Fingertips Make it easy to capture Make it easy to store & organize & analyze Make it easy to present & access
How much information is there? Yotta Zetta Exa Peta Tera Giga Mega Kilo Soon everything can be recorded and indexed Most bytes will never be seen by humans. Data summarization, trend detection anomaly detection are key technologies See Mike Lesk: How much information is there: http://www.lesk.com/mlesk/ksg97/ksg.html See Lyman & Varian: How much information http://www.sims.berkeley.edu/research/projects/how-much-info/ Everything! Recorded All Books MultiMedia All LoC books (words) .Movie A Photo A Book 24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli
Why Put Everything in Cyberspace? Low rent min $/byte Shrinks time now or later Shrinks space here or there Automate processing knowbots Point-to-Point OR Broadcast Immediate OR Time Delayed Locate Process Analyze Summarize
Disk Storage Cheaper than Paper File Cabinet: cabinet (4 drawer) 250$ paper (24,000 sheets) 250$ space (2x3 @ 10$/ft2) 180$ total 700$ 3 ¢/sheet Disk: disk (160 GB =) 300$ ASCII: 100 m pages 0.0001 ¢/sheet (10,000x cheaper) Image: 1 m photos 0.03 ¢/sheet (100x cheaper) Store everything on disk
Gordon Bell’s MainBrain™ Digitize Everything A BIG shoebox? Scans 20 k “pages” tiff@ 300 dpi 1 GB Music: 2 k “tacks” 7 GB Photos: 13 k images 2 GB Video: 10 hrs 3 GB Docs: 3 k (ppt, word,..) 2 GB Mail: 50 k messages 1 GB 16 GB
Gary Starkweather Scan EVERYTHING 400 dpi TIFF 70k “pages” ~ 14GB OCR all scans (98% recognition ocr accuracy) All indexed (5 second access to anything) All on his laptop.
A: Things will run SLOWLY…. unless we add good software Q: What happens when the personal terabyte arrives? A: Things will run SLOWLY…. unless we add good software
Summary Disks will morph to appliances Main barriers to this happening Lack of Cool Apps Cost of Information management
1 TB The “Absurd” Disk 2.5 hr scan time (poor sequential access) 1 aps / 5 GB (VERY cold data) It’s a tape! 1 TB 100 MB/s 200 Kaps
Crazy Disk Ideas Disk Farm on a card: surface mount disks Disk (magnetic store) on a chip: (micro machines in Silicon) Full Apps (e.g. SAP, Exchange/Notes,..) in the disk controller (a processor with 128 MB dram) ASIC The Innovator's Dilemma: When New Technologies Cause Great Firms to Fail Clayton M. Christensen .ISBN: 0875845851
The Disk Farm On a Card The 500GB disc card An array of discs Can be used as 100 discs 1 striped disc 50 Fault Tolerant discs ....etc LOTS of accesses/second bandwidth 14"
Trends: promises NEMS (Nano Electro Mechanical Systems) (http://www Trends: promises NEMS (Nano Electro Mechanical Systems) (http://www.nanochip.com/) also Cornell, IBM, CMU,… 250 Gbpsi by using tunneling electronic microscope Disk replacement Capacity: 180 GB now, 1.4 TB in 2 years Transfer rate: 100 MB/sec R&W Latency: 0.5msec Power: 23W active, .05W Standby 10k$/TB now, 2k$/TB in 2004
Trends: Gilder’s Law: 3x bandwidth/year for 25 more years Today: 40 Gbps per channel (λ) 12 channels per fiber (wdm): 500 Gbps 32 fibers/bundle = 16 Tbps/bundle In lab 3 Tbps/fiber (400 x WDM) In theory 25 Tbps per fiber 1 Tbps = USA 1996 WAN bisection bandwidth Aggregate bandwidth doubles every 8 months! 1 fiber = 25 Tbps
Technology Drivers: What if Networking Was as Cheap As Disk IO? TCP/IP Unix/NT 100% cpu @ 40MBps Disk Unix/NT 8% cpu @ 40MBps Why the Difference? Host Bus Adapter does SCSI packetizing, checksum,… flow control DMA Host does TCP/IP packetizing, small buffers
SAN: Standard Interconnect RIP FDDI SAN: Standard Interconnect RIP ATM Gbps Ethernet: 110 MBps LAN faster than memory bus? 1 GBps links in lab. 100$ port cost soon Port is computer RIP SCI PCI: 70 MBps RIP SCSI UW Scsi: 40 MBps FW scsi: 20 MBps RIP FC scsi: 5 MBps RIP ?
Building a Petabyte Store EMC ~ 500k$/TB = 500M$/PB plus FC switches plus… 800M$/PB TPC-C SANs (Dell 18GB/…) 62 M$/PB Dell local SCSI, 3ware 20M$/PB Do it yourself: 5M$/PB
The Cost of Storage (heading for 1K$/TB soon) 12/1/1999 9/1/2000 9/1/2001
Cheap Storage or Balanced System Low cost storage (2 x 1.5k$ servers) 6K$ TB 2x (1K$ system + 8x80GB disks + 100MbEthernet) Balanced server (7k$/.5 TB) 2x800Mhz (2k$) 256 MB (400$) 8 x 80 GB drives (2K$) Gbps Ethernet + switch (1k$) 11k$ TB, 22K$/RAIDED TB 2x800 Mhz 256 MB
320 GB, 2k$ (now) Or 8 disks/box 640 GB for ~3K$ ( or 300 GB RAID) 4x80 GB IDE (2 hot plugable) (1,000$) SCSI-IDE bridge 200k$ Box 500 Mhz cpu 256 MB SRAM Fan, power, Enet 700$ Or 8 disks/box 640 GB for ~3K$ ( or 300 GB RAID)
Hot Swap Drives for Archive or Data Interchange 25 MBps write (so can write N x 160 GB in 3 hours) 160 GB/overnite = ~N x 4 MB/second @ 19.95$/nite
Data delivery costs 1$/GB today Rent for “big” customers: 300$/megabit per second per month Improved 3x in last 6 years (!). That translates to 1$/GB at each end. You can mail a 160 GB disk for 20$. That’s 16x cheaper If overnight it’s 3 MBps. 3x160 GB ~ ½ TB
Data on Disk Can Move to RAM in 8 years 30:1 6 years
Storage Latency: How Far Away is the Data? Andromeda 9 10 Tape /Optical 2,000 Years Robot 6 Pluto 10 Disk 2 Years Springfield 1.5 hr 100 Memory This Campus 10 On Board Cache 10 min 2 On Chip Cache This Room 1 Registers My Head 1 min
More Kaps and Kaps/$ but…. Disk accesses got much less expensive Better disks Cheaper disks! But: disk arms are expensive the scarce resource 1 hour Scan vs 5 minutes in 1990 100 GB 30 MB/s
Backup: 3 scenarios Disaster Recovery: Preservation through Replication Hardware Faults: different solutions for different situations Clusters, load balancing, replication, tolerate machine/disk outages (Avoided RAID and expensive, low volume solutions) Programmer Error: versioned duplicates (no deletes)
Online Data Can build 1PB of NAS disk for 5M$ today Can SCAN (read or write) entire PB in 3 hours. Operate it as a data pump: continuous sequential scan Can deliver 1PB for 1M$ over Internet Access charge is 300$/Mbps bulk rate Need to Geoplex data (store it in two places). Need to filter/process data near the source, To minimize network costs.