Download presentation
Presentation is loading. Please wait.
1
Computers for the Post-PC Era
David Patterson, Katherine Yelick University of California at Berkeley UC Berkeley IRAM Group UC Berkeley ISTORE Group February 2000
2
Perspective on Post-PC Era
PostPC Era will be driven by 2 technologies: 1) “Gadgets”:Tiny Embedded or Mobile Devices ubiquitous: in everything e.g., successor to PDA, cell phone, wearable computers 2) Infrastructure to Support such Devices e.g., successor to Big Fat Web Servers, Database Servers
3
Outline 1) Example microprocessor for PostPC gadgets
2) Motivation and the ISTORE project vision AME: Availability, Maintainability, Evolutionary growth ISTORE’s research principles Proposed techniques for achieving AME Benchmarks for AME Conclusions and future work
4
Intelligent RAM: IRAM Microprocessor & DRAM on a single chip:
10X capacity vs. SRAM on-chip memory latency 5-10X, bandwidth X improve energy efficiency 2X-4X (no off-chip bus) serial I/O 5-10X v. buses smaller board area/volume IRAM advantages extend to: a single chip system a building block for larger systems $ Proc L2$ L o g i c f a b Bus D R A M I/O D R A M f a b Proc Bus I/O $B for separate lines for logic and memory Single chip: either processor in DRAM or memory in logic fab
5
New Architecture Directions
“…media processing will become the dominant force in computer arch. and microprocessor design.” “...new media-rich applications ... involve significant real-time processing of continuous media streams, and make heavy use of vectors of packed 8-, 16-, 32-bit integer and Fl. Pt.” Needs include real-time response, continuous media data types (no temporal locality), fine grain parallelism, coarse grain parallelism, memory bandwidth “How Multimedia Workloads Will Change Processor Design”, Diefendorff & Dubey, IEEE Computer (9/97)
6
Revive Vector Architecture
Cost: $1M each? Low latency, high BW memory system? Code density? Compilers? Performance? Power/Energy? Limited to scientific applications? Single-chip CMOS MPU/IRAM IRAM Much smaller than VLIW For sale, mature (>20 years) (We retarget Cray compilers) Easy scale speed with technology Parallel to save energy, keep performance Multimedia apps vectorizable too: N*64b, 2N*32b, 4N*16b Supercomputer industry dead? Very attractive to scale New class of applications Before had a lousy scalar processor; modest CPU will do well on many programs, vector do great on others
7
V-IRAM1: Low Power v. High Perf.
+ 4 x 64 or 8 x 32 16 x 16 x 2-way Superscalar Vector Instruction Processor ÷ Queue I/O Load/Store I/O 16K I cache Vector Registers 16K D cache 4 x 64 4 x 64 Serial I/O 1Gbit technology Put in perspective 10X of Cray T90 today Memory Crossbar Switch M M M M M M M M M M M M M M M M M M … M M I/O 4 x 64 4 x 64 4 x 64 4 x 64 … … … … … … … … 4 x 64 … … I/O M M M M M M M M M M
8
VIRAM-1: System on a Chip
Prototype scheduled for tape-out mid 2001 0.18 um EDL process 16 MB DRAM, 8 banks MIPS Scalar core and 200 MHz 4 64-bit vector unit 200 MHz 4 100 MB parallel I/O lines 17x17 mm, 2 Watts 25.6 GB/s memory (6.4 GB/s per direction and per Xbar) 1.6 Gflops (64-bit), 6.4 GOPs (16-bit) Memory (64 Mbits / 8 MBytes) 4 Vector Pipes/Lanes C P U +$ Xbar I/O Memory (64 Mbits / 8 MBytes)
9
Media Kernel Performance
10
IRAM Chip Challenges Merged Logic-DRAM process Cost: Cost of wafer, Impact on yield, testing cost of logic and DRAM Price: on-chip DRAM v. separate DRAM chips? Delay in transistor speeds, memory cell sizes in Merged process vs. Logic only or DRAM only DRAM block: flexibility via DRAM “compiler” (vary size, width, no. subbanks) vs. fixed block Apps: advantages in memory bandwidth, energy, system size to offset challenges? Or Speed, Area, power, yield of DRAM in logic process Can slowdown in performance of portion and still be attractive Testing time much worse, or better due to BIST? DRAM operate at 1 watt: every 10 degrees increase in operative temperature doubles refresh rate; what to do? IRAM: acts as MP, acts as Cache to real memory, acts as low part of physical address space + OS?
11
Other examples: IBM “Blue Gene”
1 PetaFLOPS in 2003 for $100M? Application: Protein Folding Blue Gene Chip 25-32 Multithreaded RISC processors + 0.5MB Embedded DRAM / processor + high speed Network Interface on 20 x 20 mm chip 1 GFLOPS / processor 2’ x 2’ Board = 64 chips (1.6K-2K CPUs) Rack = 8 Boards (512 chips,13K-16K CPUs) System = Racks (512 boards,32-40Kchips) Total 1 million processors, 1 MW in just 2000 sq. ft. Since single app, unbalanced system to save money Traditional ratios: 1 MIPS, 1 MB, 1 Mbit/s I/O Blue Gene ratios: 1 MIPS, MB, 0.2 Mbit/s I/O
12
Other examples: Sony Playstation 2
Emotion Engine: 6.2 GFLOPS, 75 million polygons per second (Microprocessor Report, 13:5) Superscalar MIPS core + vector coprocessor + graphics/DRAM Claim: “Toy Story” realism brought to games
13
Outline 1) Example microprocessor for PostPC gadgets
2) Motivation and the ISTORE project vision AME: Availability, Maintainability, Evolutionary growth ISTORE’s research principles Proposed techniques for achieving AME Benchmarks for AME Conclusions and future work
14
The problem space: big data
Big demand for enormous amounts of data today: high-end enterprise and Internet applications enterprise decision-support, data mining databases online applications: e-commerce, mail, web, archives future: infrastructure services, richer data computational & storage back-ends for mobile devices more multimedia content more use of historical data to provide better services Today’s SMP server designs can’t easily scale Bigger scaling problems than performance!
15
Lampson: Systems Challenges
Systems that work Meeting their specs Always available Adapting to changing environment Evolving while they run Made from unreliable components Growing without practical limit Credible simulations or analysis Writing good specs Testing Performance Understanding when it doesn’t matter “Computer Systems Research -Past and Future” Keynote address, 17th SOSP, Dec. 1999 Butler Lampson Microsoft
16
Hennessy: What Should the “New World” Focus Be?
Availability Both appliance & service Maintainability Two functions: Enhancing availability by preventing failure Ease of SW and HW upgrades Scalability Especially of service Cost per device and per service transaction Performance Remains important, but its not SPECint “Back to the Future: Time to Return to Longstanding Problems in Computer Systems?” Keynote address, FCRC, May 1999 John Hennessy Stanford
17
ISTORE as Storage System of the Future
Availability, Maintainability, and Evolutionary growth key challenges for storage systems Maintenance Cost = 10X to 100X Purchase Cost, so even 2X purchase cost for 1/2 maintenance cost wins AME improvement enables even larger systems ISTORE has cost-performance advantages Better space, power/cooling costs site) More MIPS, cheaper MIPS, no bus bottlenecks Compression reduces network $, encryption protects Single interconnect, supports evolution of technology Match to future software storage services Future storage service software target clusters
18
Is Maintenance the Key? Rule of Thumb: Maintenance 10X to 100X HW
VAX crashes ‘85, ‘93 [Murp95]; extrap. to ‘01 Sys. Man.: N crashes/problem, SysAdmin actions Actions: set params bad, bad config, bad app install HW/OS 70% in ‘85 to 28% in ‘93. In ‘01, 10%? [Murp95] Murphy, B.; Gent, T. Measuring system and software reliability using an automated data collection process. Quality and Reliability Engineering International, vol.11, (no.5), Sept.-Oct p
19
ISTORE-1 hardware platform
80-node x86-based cluster, 1.4TB storage cluster nodes are plug-and-play, intelligent, network-attached storage “bricks” a single field-replaceable unit to simplify maintenance each node is a full x86 PC w/256MB DRAM, 18GB disk more CPU than NAS; fewer disks/node than cluster Intelligent Disk “Brick” Portable PC CPU: Pentium II/266 + DRAM Redundant NICs (4 100 Mb/s links) Diagnostic Processor ISTORE Chassis 80 nodes, 8 per tray 2 levels of switches Mbit/s 2 1 Gbit/s Environment Monitoring: UPS, redundant PS, fans, heat and vibration sensors... Disk Half-height canister
20
ISTORE-1 Brick Webster’s Dictionary: “brick: a handy-sized unit of building or paving material typically being rectangular and about 2 1/4 x 3 3/4 x 8 inches” ISTORE-1 Brick: 2 x 4 x 11 inches (1.3x) Single physical form factor, fixed cooling required, compatible network interface to simplify physical maintenance, scaling over time Contents should evolve over time: contains most cost effective MPU, DRAM, disk, compatible NI If useful, could have special bricks (e.g., DRAM rich) Suggests network that will last, evolve: Ethernet
21
A glimpse into the future?
System-on-a-chip enables computer, memory, redundant network interfaces without significantly increasing size of disk ISTORE HW in 5-7 years: 2006 brick: System On a Chip integrated with MicroDrive 9GB disk, 50 MB/sec from disk connected via crossbar switch If low power, 10,000 nodes fit into one rack! O(10,000) scale is our ultimate design point
22
IStore-2 Deltas from IStore-1
Upgraded Storage Brick Pentium III 650 MHz Processor Two Gb Ethernet Copper Ports/brick One 2.5" ATA disk (32 GB, 5400 RPM) 2X DRAM memory Geographically Disperse Nodes, Larger System O(1000) nodes at Almaden, O(1000) at Berkeley Halve into O(500) nodes at each site to simplify finding space problem, show that it works? User Supplied UPS Support
23
ISTORE-2 Improvements (1): Operator Aids
Every Field Replaceable Unit (FRU) has a machine readable unique identifier (UID) => introspective software determines if storage system is wired properly initially, evolved properly Can a switch failure disconnect both copies of data? Can a power supply failure disable mirrored disks? Computer checks for wiring errors, informs operator vs. management blaming operator upon failure Leverage IBM Vital Product Data (VPD) technology? External Status Lights per Brick Disk active, Ethernet port active, Redundant HW active, HW failure, Software hickup, ...
24
ISTORE-2 Improvements (2): RAIN
ISTORE-1 switches 1/3 of space, power, cost, and for just 80 nodes! Redundant Array of Inexpensive Disks (RAID): replace large, expensive disks by many small, inexpensive disks, saving volume, power, cost Redundant Array of Inexpensive Network switches: replace large, expensive switches by many small, inexpensive switches, saving volume, power, cost? ISTORE-1: Replace 2 16-port 1-Gbit switches by fat tree of 8 8-port switches, or 24 4-port switches?
25
ISTORE-2 Improvements (3): System Management Language
Define high-level, intuitive, non-abstract system management language Goal: Large Systems managed by part-time operators! Language interpretive for observation, but compiled, error-checked for config. changes Examples of tasks which should be made easy Set alarm if any disk is more than 70% full Backup all data in the Philippines site to Colorado site Split system into protected subregions Discover & display present routing topology Show correlation between brick temps and crashes
26
ISTORE-2 Improvements (4): Options to Investigate
TCP/IP Hardware Accelerator Class 4: Hardware State Machine ~10 microsecond latency, full Gbit bandwidth yet full TCP/IP functionality, TCP/IP APIs Ethernet Sourced in Memory Controller (North Bridge) Shelf of bricks on researchers’ desktops? SCSI over TCP Support Integrated UPS
27
Why is ISTORE-2 a big machine?
ISTORE is all about managing truly large systems - one needs a large system to discover the real issues and opportunities target 1k nodes in UCB CS, 1k nodes in IBM ARC Large systems attract real applications Without real applications CS research runs open-loop The geographical separation of ISTORE-2 sub-clusters exposes many important issues the network is NOT transparent networked systems fail differently, often insidiously
28
A Case for Intelligent Storage
Advantages: Cost of Bandwidth Cost of Space Cost of Storage System v. Cost of Disks Physical Repair, Number of Spare Parts Cost of Processor Complexity Cluster advantages: dependability, scalability 1 v. 2 Networks
29
Cost of Space, Power, Bandwidth
Co-location sites (e.g., Exodus) offer space, expandable bandwidth, stable power Charge ~$1000/month per rack ( ~ 10 sq. ft.) Includes 1 20-amp circuit/rack; charges ~$100/month per extra 20-amp circuit/rack Bandwidth cost: ~$500 per Mbit/sec/Month
30
Cost of Bandwidth, Safety
Network bandwidth cost is significant 1000 Mbit/sec/month => $6,000,000/year Security will increase in importance for storage service providers => Storage systems of future need greater computing ability Compress to reduce cost of network bandwidth 3X; save $4M/year? Encrypt to protect information in transit for B2B => Increasing processing/disk for future storage apps
31
Cost of Space, Power Sun Enterprise server/array (64CPUs/60disks)
10K Server (64 CPUs): 70 x 50 x 39 in. A3500 Array (60 disks): 74 x 24 x 36 in. 2 Symmetra UPS (11KW): 2 * 52 x 24 x 27 in. ISTORE-1: 2X savings in space ISTORE-1: 1 rack (big) switches, 1 rack (old) UPSs, 1 rack for 80 CPUs/disks (3/8 VME rack unit/brick) ISTORE-2: 8X-16X space? Space, power cost/year for 1000 disks: Sun $924k, ISTORE-1 $484k, ISTORE2 $50k
32
Cost of Storage System v. Disks
Examples show cost of way we build current systems (2 networks, many buses, CPU, …) Disks Disks Date Cost Main. Disks /CPU /IObus NCR WM: 10/97 $8.3M Sun 10k: 3/98 $5.2M Sun 10k: 9/99 $6.2M $2.1M IBM Netinf: 7/00 $7.8M $1.8M =>Too complicated, too heterogenous And Data Bases are often CPU or bus bound! ISTORE disks per CPU: ISTORE disks per I/O bus:
33
Disk Limit: Bus Hierarchy
Server Storage Area Network CPU Memory bus (FC-AL) Internal I/O bus Memory RAID bus (PCI) Mem Data rate vs. Disk rate SCSI: Ultra3 (80 MHz), Wide (16 bit): 160 MByte/s FC-AL: 1 Gbit/s = 125 MByte/s Use only 50% of a bus Command overhead (~ 20%) Queuing Theory (< 70%) External I/O bus Disk Array (SCSI) (15 disks/bus)
34
Physical Repair, Spare Parts
ISTORE: Compatible modules based on hot-pluggable interconnect (LAN) with few Field Replacable Units (FRUs): Node, Power Supplies, Switches, network cables Replace node (disk, CPU, memory, NI) if any fail Conventional: Heterogeneous system with many server modules (CPU, backplane, memory cards, …) and disk array modules (controllers, disks, array controllers, power supplies, … ) Store all components available somewhere as FRUs Sun Enterprise 10k has ~ 100 types of spare parts Sun 3500 Array has ~ 12 types of spare parts
35
ISTORE: Complexity v. Perf
Complexity increase: HP PA-8500: issue 4 instructions per clock cycle, 56 instructions out-of-order execution, 4Kbit branch predictor, 9 stage pipeline, 512 KB I cache, 1024 KB D cache (> 80M transistors just in caches) Intel SA-110: 16 KB I$, 16 KB D$, 1 instruction, in order execution, no branch prediction, 5 stage pipeline Complexity costs in development time, development power, die size, cost 550 MHz HP PA mm2, 0.25 micron/4M $330, 60 Watts 233 MHz Intel SA mm2, 0.35 micron/3M $18, 0.4 Watts
36
ISTORE: Cluster Advantages
Architecture that tolerates partial failure Automatic hardware redundancy Transparent to application programs Truly scalable architecture Limits in size today are maintenance costs, floor space cost - generally NOT capital costs As a result, it is THE target architecture for new software apps for Internet
37
ISTORE: 1 vs. 2 networks Current systems all have LAN + Disk interconnect (SCSI, FCAL) LAN is improving fastest, most investment, most features SCSI, FC-AL poor network features, improving slowly, relatively expensive for switches, bandwidth FC-AL switches don’t interoperate Two sets of cables, wiring? Why not single network based on best HW/SW technology? Note: there can be still 2 instances of the network (e.g. external, internal), but only one technology
38
Initial Applications ISTORE is not one super-system that demonstrates all these techniques! Initially provide middleware, library to support AME Initial application targets information retrieval for multimedia data (XML storage?) self-scrubbing data structures, structuring performance-robust distributed computation Home video server via XML storage? service self-scrubbing data structures, online self-testing statistical identification of normal behavior
39
UCB ISTORE Continued Funding
New NSF Information Technology Research, larger funding (>$500K/yr) 1400 Letters 920 Preproposals 134 Full Proposals Encouraged 240 Full Proposals Submitted 60 Funded We are 1 of the 60; starts Sept 2000
40
NSF ITR Collaboration with Mills
Mills: small undergraduate liberal arts college for women; 8 miles south of Berkeley Mills students can take 1 course/semester at Berkeley Hourly shuttle between campuses Mills also has re-entry MS program for older students To increase women in Computer Science (especially African-American women): Offer undergraduate research seminar at Mills Mills Prof leads; Berkeley faculty, grad students help Mills Prof goes to Berkeley for meetings, sabbatical Goal: 2X-3X increase in Mills CS+alumnae to grad school IBM people want to help?
41
Conclusion: ISTORE as Storage System of the Future
Availability, Maintainability, and Evolutionary growth key challenges for storage systems Cost of Maintenance = 10X Cost of Purchase, so even 2X purchase cost for 1/2 maintenance cost is good AME improvement enables even larger systems ISTORE has cost-performance advantages Better space, power/cooling costs site) More MIPS, cheaper MIPS, no bus bottlenecks Compression reduces network $, encryption protects Single interconnect, supports evolution of technology Match to future software service architecture Future storage service software target clusters
42
Conclusions (1): ISTORE
Availability, Maintainability, and Evolutionary growth are key challenges for server systems more important even than performance ISTORE is investigating ways to bring AME to large-scale, storage-intensive servers via clusters of network-attached, computationally-enhanced storage nodes running distributed code via hardware and software introspection we are currently performing application studies to investigate and compare techniques Availability benchmarks a powerful tool? revealed undocumented design decisions affecting SW RAID availability on Linux and Windows 2000
43
Conclusions (2) IRAM attractive for two Post-PC applications because of low power, small size, high memory bandwidth Gadgets: Embedded/Mobile devices Infrastructure: Intelligent Storage and Networks PostPC infrastructure requires New Goals: Availability, Maintainability, Evolution New Principles: Introspection, Performance Robustness New Techniques: Isolation/fault insertion, Software scrubbing New Benchmarks: measure, compare AME metrics [Still just a vision:] the things I’ve been talking about have not yet been implemented.
44
Berkeley Future work IRAM: fab and test chip ISTORE
implement AME-enhancing techniques in a variety of Internet, enterprise, and info retrieval applications select the best techniques and integrate into a generic runtime system with “AME API” add maintainability benchmarks can we quantify administrative work needed to maintain a certain level of availability? Perhaps look at data security via encryption? Even consider denial of service?
45
The UC Berkeley IRAM/ISTORE Projects: Computers for the PostPC Era
For more information:
46
(mostly in the area of benchmarking)
Backup Slides (mostly in the area of benchmarking)
47
Case study Software RAID-5 plus web server Why software RAID?
Linux/Apache vs. Windows 2000/IIS Why software RAID? well-defined availability guarantees RAID-5 volume should tolerate a single disk failure reduced performance (degraded mode) after failure may automatically rebuild redundancy onto spare disk simple system easy to inject storage faults Why web server? an application with measurable QoS metrics that depend on RAID availability and performance
48
Benchmark environment: metrics
QoS metrics measured hits per second roughly tracks response time in our experiments degree of fault tolerance in storage system Workload generator and data collector SpecWeb99 web benchmark simulates realistic high-volume user load mostly static read-only workload; some dynamic content modified to run continuously and to measure average hits per second over each 2-minute interval
49
Benchmark environment: faults
Focus on faults in the storage system (disks) How do disks fail? according to Tertiary Disk project, failures include: recovered media errors uncorrectable write failures hardware errors (e.g., diagnostic failures) SCSI timeouts SCSI parity errors note: no head crashes, no fail-stop failures
50
Disk fault injection technique
To inject reproducible failures, we replaced one disk in the RAID with an emulated disk a PC that appears as a disk on the SCSI bus I/O requests processed in software, reflected to local disk fault injection performed by altering SCSI command processing in the emulation software Types of emulated faults: media errors (transient, correctable, uncorrectable) hardware errors (firmware, mechanical) parity errors power failures disk hangs/timeouts
51
System configuration RAID-5 Volume: 3GB capacity, 1GB used per disk
IBM 18 GB 10k RPM Server AMD K MB DRAM Linux or Win2000 IDE system disk = Fast/Wide SCSI bus, 20 MB/sec Adaptec 2940 RAID data disks IBM 18 GB 10k RPM SCSI system disk Disk Emulator AMD K Windows NT 4.0 ASC VirtualSCSI lib. Adaptec 2940 emulator backing disk (NTFS) AdvStor ASC-U2W UltraSCSI Emulated Disk Emulated Spare Disk RAID-5 Volume: 3GB capacity, 1GB used per disk 3 physical disks, 1 emulated disk, 1 emulated spare disk 2 web clients connected via 100Mb switched Ethernet
52
Results: single-fault experiments
One exp’t for each type of fault (15 total) only one fault injected per experiment no human intervention system allowed to continue until stabilized or crashed Four distinct system behaviors observed (A) no effect: system ignores fault (B) RAID system enters degraded mode (C) RAID system begins reconstruction onto spare disk (D) system failure (hang or crash)
53
State of the Art: Ultrastar 72ZX
73.4 GB, 3.5 inch disk 2¢/MB 16 MB track buffer 11 platters, 22 surfaces 15,110 cylinders 7 Gbit/sq. in. areal density 17 watts (idle) 0.1 ms controller time 5.3 ms avg. seek (seek 1 track => 0.6 ms) 3 ms = 1/2 rotation 37 to 22 MB/s to media Embed. Proc. Track Sector Cylinder Track Buffer Arm Platter Head Latency = Queuing Time + Controller time + Seek Time + Rotation Time + Size / Bandwidth per access per byte { + source: 2/14/00
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.