Computers for the Post-PC Era

Computers for the Post-PC Era
David Patterson University of California at Berkeley UC Berkeley IRAM Group UC Berkeley ISTORE Group February 2000

Perspective on Post-PC Era
PostPC Era will be driven by 2 technologies: 1) “Gadgets”:Tiny Embedded or Mobile Devices ubiquitous: in everything e.g., successor to PDA, cell phone, wearable computers 2) Infrastructure to Support such Devices e.g., successor to Big Fat Web Servers, Database Servers

Outline 1) Example microprocessor for PostPC gadgets
2) Motivation and the ISTORE project vision AME: Availability, Maintainability, Evolutionary growth ISTORE’s research principles Proposed techniques for achieving AME Benchmarks for AME Conclusions and future work

Intelligent RAM: IRAM Microprocessor & DRAM on a single chip:
10X capacity vs. SRAM on-chip memory latency 5-10X, bandwidth X improve energy efficiency 2X-4X (no off-chip bus) serial I/O 5-10X v. buses smaller board area/volume IRAM advantages extend to: a single chip system a building block for larger systems $ Proc L2$ L o g i c f a b Bus D R A M I/O D R A M f a b Proc Bus I/O $B for separate lines for logic and memory Single chip: either processor in DRAM or memory in logic fab

Revive Vector Architecture
Cost: $1M each? Low latency, high BW memory system? Code density? Compilers? Performance? Power/Energy? Limited to scientific applications? Single-chip CMOS MPU/IRAM IRAM Much smaller than VLIW For sale, mature (>20 years) (We retarget Cray compilers) Easy scale speed with technology Parallel to save energy, keep perf Multimedia apps vectorizable too: N*64b, 2N*32b, 4N*16b Supercomputer industry dead? Very attractive to scale New class of applications Before had a lousy scalar processor; modest CPU will do well on many programs, vector do great on others

V-IRAM1: Low Power v. High Perf.
+ 4 x 64 or 8 x 32 16 x 16 x 2-way Superscalar Vector Instruction Processor ÷ Queue I/O Load/Store I/O 16K I cache Vector Registers 16K D cache 4 x 64 4 x 64 Serial I/O 1Gbit technology Put in perspective 10X of Cray T90 today Memory Crossbar Switch M M M M M M M M M M M M M M M M M M … M M I/O 4 x 64 4 x 64 4 x 64 4 x 64 … … … … … … … … 4 x 64 … … I/O M M M M M M M M M M

VIRAM-1: System on a Chip
Prototype scheduled for tape-out mid 2000 0.18 um EDL process 16 MB DRAM, 8 banks MIPS Scalar core and 200 MHz 4 64-bit vector unit 200 MHz 4 100 MB parallel I/O lines 17x17 mm, 2 Watts 25.6 GB/s memory (6.4 GB/s per direction and per Xbar) 1.6 Gflops (64-bit), 6.4 GOPs (16-bit) Memory (64 Mbits / 8 MBytes) 4 Vector Pipes/Lanes C P U +$ Xbar I/O Memory (64 Mbits / 8 MBytes)

Media Kernel Performance

Base-line system comparison
All numbers in cycles/pixel MMX and VIS results assume all data in L1 cache

IRAM Chip Challenges Merged Logic-DRAM process Cost: Cost of wafer, Impact on yield, testing cost of logic and DRAM Price: on-chip DRAM v. separate DRAM chips? Delay in transistor speeds, memory cell sizes in Merged process vs. Logic only or DRAM only DRAM block: flexibility via DRAM “compiler” (vary size, width, no. subbanks) vs. fixed block Apps: advantages in memory bandwidth, energy, system size to offset challenges? Or Speed, Area, power, yield of DRAM in logic process Can slowdown in performance of portion and still be attractive Testing time much worse, or better due to BIST? DRAM operate at 1 watt: every 10 degrees increase in operative temperature doubles refresh rate; what to do? IRAM: acts as MP, acts as Cache to real memory, acts as low part of physical address space + OS?

Other examples: IBM “Blue Gene”
1 PetaFLOPS in 2005 for $100M? Application: Protein Folding Blue Gene Chip 32 Multithreaded RISC processors + ??MB Embedded DRAM + high speed Network Interface on single 20 x 20 mm chip 1 GFLOPS / processor 2’ x 2’ Board = 64 chips (2K CPUs) Rack = 8 Boards (512 chips,16K CPUs) System = 64 Racks (512 boards,32K chips,1M CPUs) Total 1 million processors in just 2000 sq. ft.

Other examples: Sony Playstation 2
Emotion Engine: 6.2 GFLOPS, 75 million polygons per second (Microprocessor Report, 13:5) Superscalar MIPS core + vector coprocessor + graphics/DRAM Claim: “Toy Story” realism brought to games

The problem space: big data
Big demand for enormous amounts of data today: high-end enterprise and Internet applications enterprise decision-support, data mining databases online applications: e-commerce, mail, web, archives future: infrastructure services, richer data computational & storage back-ends for mobile devices more multimedia content more use of historical data to provide better services Today’s SMP server designs can’t easily scale to meet these huge demands

One approach: traditional NAS
Network-attached storage makes storage devices first-class citizens on the network network file server appliances (NetApp, SNAP, ...) storage-area networks (CMU NASD, NSIC OOD, ...) active disks (CMU, UCSB, Berkeley IDISK) These approaches primarily target performance scalability scalable networks remove bus bandwidth limitations migration of layout functionality to storage devices removes overhead of intermediate servers There are bigger scaling problems than scalable performance!

The real scalability problems: AME
Availability systems should continue to meet quality of service goals despite hardware and software failures Maintainability systems should require only minimal ongoing human administration, regardless of scale or complexity Evolutionary Growth systems should evolve gracefully in terms of performance, maintainability, and availability as they are grown/upgraded/expanded These are problems at today’s scales, and will only get worse as systems grow

The ISTORE project vision
Our goal: develop principles and investigate hardware/software techniques for building storage-based server systems that: are highly available require minimal maintenance robustly handle evolutionary growth are scalable to O(10000) nodes

Principles for achieving AME (1)
No single points of failure Redundancy everywhere Performance robustness is more important than peak performance “performance robustness” implies that real-world performance is comparable to best-case performance Performance can be sacrificed for improvements in AME resources should be dedicated to AME compare: biological systems spend > 50% of resources on maintenance can make up performance by scaling system

Principles for achieving AME (2)
Introspection reactive techniques to detect and adapt to failures, workload variations, and system evolution proactive techniques to anticipate and avert problems before they happen

Hardware techniques Fully shared-nothing cluster organization
truly scalable architecture architecture that tolerates partial failure automatic hardware redundancy No Central Processor Unit: distribute processing with storage if AME is important, must provide resources to be used for AME Nodes responsible for health of their storage Serial lines, switches also growing with Moore’s Law; less need today to centralize vs. bus oriented systems

Hardware techniques (2)
Heavily instrumented hardware sensors for temp, vibration, humidity, power, intrusion helps detect environmental problems before they can affect system integrity Independent diagnostic processor on each node provides remote control of power, remote console access to the node, selection of node boot code collects, stores, processes environmental data for abnormalities non-volatile “flight recorder” functionality all diagnostic processors connected via independent diagnostic network

Hardware techniques (3)
On-demand network partitioning/isolation Allows testing, repair of online system managed by diagnostic processor and network switches via diagnostic network Built-in fault injection capabilities power control to individual node components injectable glitches into I/O and memory busses managed by diagnostic processor used for proactive hardware introspection automated detection of flaky components controlled testing of error-recovery mechanisms important for AME benchmarking

“Hardware” techniques (4)
Benchmarking One reason for 1000X processor performance was ability to measure (vs. debate) which is better e.g., Which most important to imrpove: clock rate, clocks per instruction, or instructions executed? Need AME benchmarks “what gets measured gets done” “benchmarks shape a field” “quantification brings rigor”

ISTORE-1 hardware platform
80-node x86-based cluster, 1.4TB storage cluster nodes are plug-and-play, intelligent, network-attached storage “bricks” a single field-replaceable unit to simplify maintenance each node is a full x86 PC w/256MB DRAM, 18GB disk more CPU than NAS; fewer disks/node than cluster Intelligent Disk “Brick” Portable PC CPU: Pentium II/266 + DRAM Redundant NICs (4 100 Mb/s links) Diagnostic Processor ISTORE Chassis 80 nodes, 8 per tray 2 levels of switches Mb/s 2 1 Gb/s Environment Monitoring: UPS, redundant PS, fans, heat and vibration sensors... Disk Half-height canister

ISTORE Brick Block Diagram
Mobile Pentium II Module SCSI North Bridge CPU Disk (18 GB) South Bridge Diagnostic Net DUAL UART DRAM 256 MB Super I/O Monitor & Control Diagnostic Processor BIOS Ethernets 4x100 Mb/s PCI Sensors for heat and vibration Control over power to individual nodes Flash RTC RAM

A glimpse into the future?
System-on-a-chip enables computer, memory, redundant network interfaces without significantly increasing size of disk ISTORE HW in 5-7 years: building block: 2006 MicroDrive integrated with IRAM 9GB disk, 50 MB/sec from disk connected via crossbar switch 10,000 nodes fit into one rack! O(10,000) scale is our ultimate design point

Software techniques Fully-distributed, shared-nothing code
centralization breaks as systems scale up O(10000) avoids single-point-of-failure front ends Redundant data storage required for high availability, simplifies self-testing replication at the level of application objects application can control consistency policy more opportunity for data placement optimization

Software techniques (2)
“River” storage interfaces NOW Sort experience: performance heterogeneity is the norm e.g., disks: outer vs. inner track (1.5X), fragmentation e.g., processors: load (1.5-5x) So demand-driven delivery of data to apps via distributed queues and graduated declustering for apps that can handle unordered data delivery automatically adapts to variations in performance of producers and consumers

Reactive introspection use statistical techniques to identify normal behavior and detect deviations from it policy-driven automatic adaptation to abnormal behavior once detected initially, rely on human administrator to specify policy eventually, system learns to solve problems on its own by experimenting on isolated subsets of the nodes one candidate: reinforcement learning

Proactive introspection continuous online self-testing of HW and SW in deployed systems! goal is to shake out “Heisenbugs” before they’re encountered in normal operation needs data redundancy, node isolation, fault injection techniques: fault injection: triggering hardware and software error handling paths to verify their integrity/existence stress testing: push HW/SW to their limits scrubbing: periodic restoration of potentially “decaying” hardware or software state self-scrubbing data structures (like MVS) ECC scrubbing for disks and memory

Applications ISTORE is not one super-system that demonstrates all these techniques! Initially provide library to support AME goals Initial application targets cluster web/ servers self-scrubbing data structures, online self-testing statistical identification of normal behavior decision-support database query execution system River-based storage, replica management information retrieval for multimedia data self-scrubbing data structures, structuring performance-robust distributed computation

Availability benchmarks
Questions to answer what factors affect the quality of service delivered by the system, and by how much/how long? how well can systems survive typical failure scenarios? Availability metrics traditionally, percentage of time system is up time-averaged, binary view of system state (up/down) traditional metric is too inflexible doesn’t capture spectrum of degraded states time-averaging discards important temporal behavior Solution: measure variation in system quality of service metrics over time performance, fault-tolerance, completeness, accuracy

Availability benchmark methodology
Goal: quantify variation in QoS metrics as events occur that affect system availability Leverage existing performance benchmarks to generate fair workloads to measure & trace quality of service metrics Use fault injection to compromise system hardware faults (disk, memory, network, power) software faults (corrupt input, driver error returns) maintenance events (repairs, SW/HW upgrades) Examine single-fault and multi-fault workloads the availability analogues of performance micro- and macro-benchmarks

Methodology: reporting results
Results are most accessible graphically plot change in QoS metrics over time compare to “normal” behavior? 99% confidence intervals calculated from no-fault runs Graphs can be distilled into numbers?

Example results: software RAID-5
Test systems: Linux/Apache and Win2000/IIS SpecWeb ’99 to measure hits/second as QoS metric fault injection at disks based on empirical fault data transient, correctable, uncorrectable, & timeout faults 15 single-fault workloads injected per system only 4 distinct behaviors observed (A) no effect (C) RAID enters degraded mode (B) system hangs (D) RAID enters degraded mode & starts reconstruction both systems hung (B) on simulated disk hangs Linux exhibited (D) on all other errors Windows exhibited (A) on transient errors and (C) on uncorrectable, sticky errors

Example results: multiple-faults
Windows 2000/IIS Linux/ Apache Windows reconstructs ~3x faster than Linux Windows reconstruction noticeably affects application performance, while Linux reconstruction does not

Conclusions (1): Benchmarks
Linux and Windows take opposite approaches to managing benign and transient faults Linux is paranoid and stops using a disk on any error Windows ignores most benign/transient faults Windows is more robust except when disk is truly failing Linux and Windows have different reconstruction philosophies Linux uses idle bandwidth for reconstruction Windows steals app. bandwidth for reconstruction Windows rebuilds fault-tolerance more quickly Win2k favors fault-tolerance over performance; Linux favors performance over fault-tolerance

Conclusions (2): ISTORE
Availability, Maintainability, and Evolutionary growth are key challenges for server systems more important even than performance ISTORE is investigating ways to bring AME to large-scale, storage-intensive servers via clusters of network-attached, computationally-enhanced storage nodes running distributed code via hardware and software introspection we are currently performing application studies to investigate and compare techniques Availability benchmarks a powerful tool? revealed undocumented design decisions affecting SW RAID availability on Linux and Windows 2000

Conclusions (3) IRAM attractive for two Post-PC applications because of low power, small size, high memory bandwidth Gadgets: Embedded/Mobile devices Scaleable infrastructure ISTORE: hardware/software architecture for large scale network services PostPC infrastructure requires new goals: Availability/Maintainability/Evolution new principles: Introspection, Performance Robustness new techniques: Isolation/fault insertion, SW scrubbing [Still just a vision:] the things I’ve been talking about have not yet been implemented.

Future work IRAM: fab and test chip ISTORE
implement AME-enhancing techniques in a variety of Internet, enterprise, and info retrieval applications select the best techniques and integrate into a generic runtime system with “AME API” add maintainability benchmarks can we quantify administrative work needed to maintain a certain level of availability? Perhaps look at data security via encryption? Consider denial of service? (or a job for IATF?)

The UC Berkeley IRAM/ISTORE Projects: Computers for the PostPC Era
For more information:

(mostly in the area of benchmarking)
Backup Slides (mostly in the area of benchmarking)

Case study Software RAID-5 plus web server Why software RAID?
Linux/Apache vs. Windows 2000/IIS Why software RAID? well-defined availability guarantees RAID-5 volume should tolerate a single disk failure reduced performance (degraded mode) after failure may automatically rebuild redundancy onto spare disk simple system easy to inject storage faults Why web server? an application with measurable QoS metrics that depend on RAID availability and performance

Benchmark environment: metrics
QoS metrics measured hits per second roughly tracks response time in our experiments degree of fault tolerance in storage system Workload generator and data collector SpecWeb99 web benchmark simulates realistic high-volume user load mostly static read-only workload; some dynamic content modified to run continuously and to measure average hits per second over each 2-minute interval

Benchmark environment: faults
Focus on faults in the storage system (disks) How do disks fail? according to Tertiary Disk project, failures include: recovered media errors uncorrectable write failures hardware errors (e.g., diagnostic failures) SCSI timeouts SCSI parity errors note: no head crashes, no fail-stop failures

Disk fault injection technique
To inject reproducible failures, we replaced one disk in the RAID with an emulated disk a PC that appears as a disk on the SCSI bus I/O requests processed in software, reflected to local disk fault injection performed by altering SCSI command processing in the emulation software Types of emulated faults: media errors (transient, correctable, uncorrectable) hardware errors (firmware, mechanical) parity errors power failures disk hangs/timeouts

System configuration RAID-5 Volume: 3GB capacity, 1GB used per disk
IBM 18 GB 10k RPM Server AMD K MB DRAM Linux or Win2000 IDE system disk = Fast/Wide SCSI bus, 20 MB/sec Adaptec 2940 RAID data disks IBM 18 GB 10k RPM SCSI system disk Disk Emulator AMD K Windows NT 4.0 ASC VirtualSCSI lib. Adaptec 2940 emulator backing disk (NTFS) AdvStor ASC-U2W UltraSCSI Emulated Disk Emulated Spare Disk RAID-5 Volume: 3GB capacity, 1GB used per disk 3 physical disks, 1 emulated disk, 1 emulated spare disk 2 web clients connected via 100Mb switched Ethernet

Results: single-fault experiments
One exp’t for each type of fault (15 total) only one fault injected per experiment no human intervention system allowed to continue until stabilized or crashed Four distinct system behaviors observed (A) no effect: system ignores fault (B) RAID system enters degraded mode (C) RAID system begins reconstruction onto spare disk (D) system failure (hang or crash)

System behavior: single-fault
(A) no effect (B) enter degraded mode (C) begin reconstruction (D) system failure

System behavior: single-fault (2)
Windows ignores benign faults Windows can’t automatically rebuild Linux reconstructs on all errors Both systems fail when disk hangs

Interpretation: single-fault exp’ts
Linux and Windows take opposite approaches to managing benign and transient faults these faults do not necessarily imply a failing disk Tertiary Disk: 368/368 disks had transient SCSI errors; 13/368 disks had transient hardware errors, only 2/368 needed replacing. Linux is paranoid and stops using a disk on any error fragile: system is more vulnerable to multiple faults but no chance of slowly-failing disk impacting perf. Windows ignores most benign/transient faults robust: less likely to lose data, more disk-efficient less likely to catch slowly-failing disks and remove them Neither policy is ideal! need a hybrid?

Results: multiple-fault experiments
Scenario (1) disk fails (2) data is reconstructed onto spare (3) spare fails (4) administrator replaces both failed disks (5) data is reconstructed onto new disks Requires human intervention to initiate reconstruction on Windows 2000 simulate 6 minute sysadmin response time to replace disks simulate 90 seconds of time to replace hot-swap disks

Interpretation: multi-fault exp’ts
Linux and Windows have different reconstruction philosophies Linux uses idle bandwidth for reconstruction little impact on application performance increases length of time system is vulnerable to faults Windows steals app. bandwidth for reconstruction reduces application performance minimizes system vulnerability but must be manually initiated (or scripted) Windows favors fault-tolerance over performance; Linux favors performance over fault-tolerance the same design philosophies seen in the single-fault experiments

Maintainability Observations
Scenario: administrator accidentally removes and replaces live disk in degraded mode double failure; no guarantee on data integrity theoretically, can recover if writes are queued Windows recovers, but loses active writes journalling NTFS is not corrupted all data not being actively written is intact Linux will not allow removed disk to be reintegrated total loss of all data on RAID volume!

Maintainability Observations (2)
Scenario: administrator adds a new spare a common task that can be done with hot-swap drive trays Linux requires a reboot for the disk to be recognized Windows can dynamically detect the new disk Windows 2000 RAID is easier to maintain easier GUI configuration more flexible in adding disks SCSI rescan and NTFS deal with administrator goofs less likely to require administration due to transient errors BUT must manually initiate reconstruction when needed

Computers for the Post-PC Era

Similar presentations

Presentation on theme: "Computers for the Post-PC Era"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Computers for the Post-PC Era

Similar presentations

Presentation on theme: "Computers for the Post-PC Era"— Presentation transcript:

Similar presentations

About project

Feedback