ISTORE Update David Patterson University of California at Berkeley

Slides:



Advertisements
Similar presentations
Express5800/ft series servers Product Information Fault-Tolerant General Purpose Servers.
Advertisements

U Computer Systems Research: Past and Future u Butler Lampson u People have been inventing new ideas in computer systems for nearly four decades, usually.
2. Computer Clusters for Scalable Parallel Computing
Slide 1 Patterson’s Projects, People, Impact Reduced Instruction Set Computer (RISC) –What: simplified instructions to exploit VLSI: ‘80-’84 –With:
Operating System Support Focus on Architecture
Chapter 1 and 2 Computer System and Operating System Overview
Slide 1 ISTORE: System Support for Introspective Storage Appliances Aaron Brown, David Oppenheimer, and David Patterson Computer Science Division University.
1 IRAM and ISTORE David Patterson, Katherine Yelick, John Kubiatowicz U.C. Berkeley, EECS
Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.
PMIT-6102 Advanced Database Systems
Chapter 10 : Designing a SQL Server 2005 Solution for High Availability MCITP Administrator: Microsoft SQL Server 2005 Database Server Infrastructure Design.
Guide to Linux Installation and Administration, 2e 1 Chapter 9 Preparing for Emergencies.
APC InfraStruxure TM Central Smart Plug-In for HP Operations Manager Manage Power, Cooling, Security, Environment, Rack Access and Physical Layer Infrastructure.
1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.
Network management Network management refers to the activities, methods, procedures, and tools that pertain to the operation, administration, maintenance,
1 Device Controller I/O units typically consist of A mechanical component: the device itself An electronic component: the device controller or adapter.
Slide 1 ISTORE Update David Patterson University of California at Berkeley UC Berkeley IRAM Group UC Berkeley ISTORE Group
CT101: Computing Systems Introduction to Operating Systems.
Building a Data Warehouse
Chapter 1 Characterization of Distributed Systems
OPERATING SYSTEM CONCEPT AND PRACTISE
Introduction to: The Architecture of the Internet
Monitoring Windows Server 2012
Chapter 6: Securing the Cloud
Computers for the Post-PC Era
Chapter 19: Network Management
Lecture 2: Performance Evaluation
Architectures of Digital Information Systems Part 1: Interrupts and DMA dr.ir. A.C. Verschueren Eindhoven University of Technology Section of Digital.
Chapter 13: I/O Systems Modified by Dr. Neerja Mhaskar for CS 3SH3.
Clouds , Grids and Clusters
Jacob R. Lorch Microsoft Research
Memory COMPUTER ARCHITECTURE
2. OPERATING SYSTEM 2.1 Operating System Function
Embracing Failure: A Case for Recovery-Oriented Computing
Operating Systems (CS 340 D)
Advanced Operating Systems Lecture notes
Berkeley Cluster: Zoom Project
Introduction to Operating System (OS)
Introduction.
Scaling for the Future Katherine Yelick U.C. Berkeley, EECS
Storage Virtualization
Cloud Computing.
A Survey on Distributed File Systems
Introduction to: The Architecture of the Internet
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
A Berkeley View of Systems Challenges for AI
Chapter 17: Database System Architectures
QNX Technology Overview
Objective Understand the concepts of modern operating systems by investigating the most popular operating system in the current and future market Provide.
Web Server Administration
Fault Tolerance Distributed Web-based Systems
Introduction to: The Architecture of the Internet
Operating Systems Chapter 5: Input/Output Management
Computers for the Post-PC Era
Introduction to Teradata
Research in Internet Scale Systems
The Problem ~6,000 PCs Another ~1,000 boxes But! Affected by:
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Intel Active Management Technology
Chapter 4 Multiprocessors
Co-designed Virtual Machines for Reliable Computer Systems
Performance And Scalability In Oracle9i And SQL Server 2000
Database System Architectures
Chapter 2 Operating System Overview
Objective Understand the concepts of modern operating systems by investigating the most popular operating system in the current and future market Provide.
Chapter 13: I/O Systems.
Enterprise Class Virtual Tape Libraries
Chapter 13: I/O Systems “The two main jobs of a computer are I/O and [CPU] processing. In many cases, the main job is I/O, and the [CPU] processing is.
Presentation transcript:

ISTORE Update David Patterson University of California at Berkeley Patterson@cs.berkeley.edu UC Berkeley IRAM Group UC Berkeley ISTORE Group istore-group@cs.berkeley.edu May 2000

Lampson: Systems Challenges Systems that work Meeting their specs Always available Adapting to changing environment Evolving while they run Made from unreliable components Growing without practical limit Credible simulations or analysis Writing good specs Testing Performance Understanding when it doesn’t matter “Computer Systems Research -Past and Future” Keynote address, 17th SOSP, Dec. 1999 Butler Lampson Microsoft

Hennessy: What Should the “New World” Focus Be? Availability Both appliance & service Maintainability Two functions: Enhancing availability by preventing failure Ease of SW and HW upgrades Scalability Especially of service Cost per device and per service transaction Performance Remains important, but its not SPECint “Back to the Future: Time to Return to Longstanding Problems in Computer Systems?” Keynote address, FCRC, May 1999 John Hennessy Stanford

The real scalability problems: AME Availability systems should continue to meet quality of service goals despite hardware and software failures Maintainability systems should require only minimal ongoing human administration, regardless of scale or complexity Evolutionary Growth systems should evolve gracefully in terms of performance, maintainability, and availability as they are grown/upgraded/expanded These are problems at today’s scales, and will only get worse as systems grow

Principles for achieving AME (1) No single points of failure Redundancy everywhere Performance robustness is more important than peak performance “performance robustness” implies that real-world performance is comparable to best-case performance Performance can be sacrificed for improvements in AME resources should be dedicated to AME compare: biological systems spend > 50% of resources on maintenance can make up performance by scaling system

Principles for achieving AME (2) Introspection reactive techniques to detect and adapt to failures, workload variations, and system evolution proactive techniques to anticipate and avert problems before they happen

ISTORE-1 hardware platform 80-node x86-based cluster, 1.4TB storage cluster nodes are plug-and-play, intelligent, network-attached storage “bricks” a single field-replaceable unit to simplify maintenance each node is a full x86 PC w/256MB DRAM, 18GB disk more CPU than NAS; fewer disks/node than cluster Intelligent Disk “Brick” Portable PC CPU: Pentium II/266 + DRAM Redundant NICs (4 100 Mb/s links) Diagnostic Processor ISTORE Chassis 80 nodes, 8 per tray 2 levels of switches 20 100 Mbit/s 2 1 Gbit/s Environment Monitoring: UPS, redundant PS, fans, heat and vibration sensors... Disk Half-height canister

ISTORE-1 Status 10 Nodes manufactured; 45 board fabbed, 40 to go Boots OS Diagnostic Processor Interface SW complete PCB backplane: not yet designed Finish 80 node system: Summer 2000

Hardware techniques Fully shared-nothing cluster organization truly scalable architecture architecture that tolerates partial failure automatic hardware redundancy

Hardware techniques (2) No Central Processor Unit: distribute processing with storage Serial lines, switches also growing with Moore’s Law; less need today to centralize vs. bus oriented systems Most storage servers limited by speed of CPUs; why does this make sense? Why not amortize sheet metal, power, cooling infrastructure for disk to add processor, memory, and network? If AME is important, must provide resources to be used to help AME: local processors responsible for health and maintenance of their storage

Hardware techniques (3) Heavily instrumented hardware sensors for temp, vibration, humidity, power, intrusion helps detect environmental problems before they can affect system integrity Independent diagnostic processor on each node provides remote control of power, remote console access to the node, selection of node boot code collects, stores, processes environmental data for abnormalities non-volatile “flight recorder” functionality all diagnostic processors connected via independent diagnostic network

Hardware techniques (4) On-demand network partitioning/isolation Internet applications must remain available despite failures of components, therefore can isolate a subset for preventative maintenance Allows testing, repair of online system Managed by diagnostic processor and network switches via diagnostic network

Hardware techniques (5) Built-in fault injection capabilities Power control to individual node components Injectable glitches into I/O and memory busses Managed by diagnostic processor Used for proactive hardware introspection automated detection of flaky components controlled testing of error-recovery mechanisms Important for AME benchmarking (see next slide)

“Hardware” techniques (6) Benchmarking One reason for 1000X processor performance was ability to measure (vs. debate) which is better e.g., Which most important to improve: clock rate, clocks per instruction, or instructions executed? Need AME benchmarks “what gets measured gets done” “benchmarks shape a field” “quantification brings rigor”

Availability benchmark methodology Goal: quantify variation in QoS metrics as events occur that affect system availability Leverage existing performance benchmarks to generate fair workloads to measure & trace quality of service metrics Use fault injection to compromise system hardware faults (disk, memory, network, power) software faults (corrupt input, driver error returns) maintenance events (repairs, SW/HW upgrades) Examine single-fault and multi-fault workloads the availability analogues of performance micro- and macro-benchmarks

Benchmark Availability? Methodology for reporting results Results are most accessible graphically plot change in QoS metrics over time compare to “normal” behavior? 99% confidence intervals calculated from no-fault runs

Example single-fault result Linux Solaris Compares Linux and Solaris reconstruction Linux: minimal performance impact but longer window of vulnerability to second fault Solaris: large perf. impact but restores redundancy fast

Reconstruction Policy Linux: favors performance over data availability automatically-initiated reconstruction, idle bandwidth virtually no performance impact on application very long window of vulnerability (>1hr for 3GB RAID) Solaris: favors data availability over app. perf. automatically-initiated reconstruction at high BW as much as 34% drop in application performance short window of vulnerability (10 minutes for 3GB) Windows: favors neither! manually-initiated reconstruction at moderate BW as much as 18% app. performance drop somewhat short window of vulnerability (23 min/3GB)

Transient error handling Policy Linux is paranoid with respect to transients stops using affected disk (and reconstructs) on any error, transient or not fragile: system is more vulnerable to multiple faults disk-inefficient: wastes two disks per transient but no chance of slowly-failing disk impacting perf. Solaris and Windows are more forgiving both ignore most benign/transient faults robust: less likely to lose data, more disk-efficient less likely to catch slowly-failing disks and remove them Neither policy is ideal! need a hybrid that detects streams of transients

Software techniques Fully-distributed, shared-nothing code centralization breaks as systems scale up O(10000) avoids single-point-of-failure front ends Redundant data storage required for high availability, simplifies self-testing replication at the level of application objects application can control consistency policy more opportunity for data placement optimization

Software techniques (2) “River” storage interfaces NOW Sort experience: performance heterogeneity is the norm e.g., disks: outer vs. inner track (1.5X), fragmentation e.g., processors: load (1.5-5x) So demand-driven delivery of data to apps via distributed queues and graduated declustering for apps that can handle unordered data delivery Automatically adapts to variations in performance of producers and consumers Also helps with evolutionary growth of cluster

Software techniques (3) Reactive introspection Use statistical techniques to identify normal behavior and detect deviations from it Policy-driven automatic adaptation to abnormal behavior once detected initially, rely on human administrator to specify policy eventually, system learns to solve problems on its own by experimenting on isolated subsets of the nodes one candidate: reinforcement learning

Software techniques (4) Proactive introspection Continuous online self-testing of HW and SW in deployed systems! goal is to shake out “Heisenbugs” before they’re encountered in normal operation needs data redundancy, node isolation, fault injection Techniques: fault injection: triggering hardware and software error handling paths to verify their integrity/existence stress testing: push HW/SW to their limits scrubbing: periodic restoration of potentially “decaying” hardware or software state self-scrubbing data structures (like MVS) ECC scrubbing for disks and memory

Initial Applications ISTORE is not one super-system that demonstrates all these techniques! Initially provide middleware, library to support AME goals Initial application targets cluster web/email servers self-scrubbing data structures, online self-testing statistical identification of normal behavior information retrieval for multimedia data self-scrubbing data structures, structuring performance-robust distributed computation

A glimpse into the future? System-on-a-chip enables computer, memory, redundant network interfaces without significantly increasing size of disk ISTORE HW in 5-7 years: building block: 2006 MicroDrive integrated with IRAM 9GB disk, 50 MB/sec from disk connected via crossbar switch If low power, 10,000 nodes fit into one rack! O(10,000) scale is our ultimate design point

Future Targets Maintenance in DoD application Security in Computer Systems Computer Vision

Maintenance in DoD systems Introspective Middleware, Builtin Fault Injection, Diagnostic Computer, Isolatable Subsystems ... should reduce Maintenance of DoD Hardware and Software systems Is Maintenance a major concern of DoD? Does Improved Maintenance fit within Goals of Polymorphous Computing Architecture?

Security in DoD Systems? Separate Diagnostic Processor and Network gives interesting Security possibilities Monitoring of behavior by separate computer Isolation of portion of cluster from rest of network Remote reboot, software installation

Attacking Computer Vision Analogy: Computer Vision Recognition in 2000 like Computer Speech Recognition in 1985 Pre 1985 community searching for good algorithms: classic AI vs. statistics? By 1985 reached consensus on statistics Field focuses and makes progress, uses special hardware Systems become fast enough that can train systems rather than preload information, which accelerates progress By 1995 speech regonition systems starting to deploy By 2000 widely used, available on PCs

Computer Vision at Berkeley Jitendra Malik believes has an approach that is very promising 2 step process: 1) Segmentation: Divide image into regions of coherent color, texture and motion 2) Recognition: combine regions and search image database to find a match Algorithms for 1) work well, just slowly (300 seconds per image using PC) Algorithms for 2) being tested this summer using hundreds of PCs; will determine accuracy

Human Quality Computer Vision Suppose Algorithms Work: What would it take to match Human Vision? At 30 images per second: segmentation Convolution and Vector-Matrix Multiply of Sparse Matrices (10,000 x 10,000, 10% nonzero/row) 32-bit Floating Point 300 seconds on PC (assuming 333 MFLOPS) => 100G FL Ops/image 30 Hz => 3000 GFLOPs machine to do segmentation

Human Quality Computer Vision At 1 / second: object recognition Human can remember 10,000 to 100,000 objects per category (e.g., 10k faces, 10k Chinese characters, high school vocabulary of 50k words, ..) To recognize a 3D object, need ~10 2D views 100 x 100 x 8 bit (or fewer bits) per view => 10,000 x 10 x 100 x 100 bytes or 109 bytes Pruning using color and texture and by organizing shapes into an index reduce shape matches to 1000 Compare 1000 candidate merged regions with 1000 candidate object images If 10 hours on PC (333 MFLOPS) => 12000 GFLOPS

ISTORE Successor does Human Quality Vision? 10,000 nodes with System-On-A-Chip + Microdrive + network 1 to 10 GFLOPS/node => 10,000 to 100,000 GFLOPS High Bandwidth Network 1 to 10 GB of Disk Storage per Node => can replicate images per node Need Dependability, Maintainability advances to keep 10,000 nodes useful Human quality vision useful for DoD Apps? Retrainable recognition?

Conclusions (1): ISTORE Availability, Maintainability, and Evolutionary growth are key challenges for server systems more important even than performance ISTORE is investigating ways to bring AME to large-scale, storage-intensive servers via clusters of network-attached, computationally-enhanced storage nodes running distributed code via hardware and software introspection we are currently performing application studies to investigate and compare techniques Availability benchmarks a powerful tool? revealed undocumented design decisions affecting SW RAID availability on Linux and Windows 2000