IRAM and ISTORE Projects

Slides:



Advertisements
Similar presentations
The University of Adelaide, School of Computer Science
Advertisements

© Chinese University, CSE Dept. Software Engineering / Software Engineering Topic 1: Software Engineering: A Preview Your Name: ____________________.
Slide 1 Computers for the Post-PC Era John Kubiatowicz, Kathy Yelick, and David Patterson IBM Visit.
Slide 1 Adaptive Compilers and Runtime Systems Kathy Yelick U.C. Berkeley.
VIRAM-1 Architecture Update and Status Christoforos E. Kozyrakis IRAM Retreat January 2000.
University College Cork IRELAND Hardware Concepts An understanding of computer hardware is a vital prerequisite for the study of operating systems.
Retrospective on the VIRAM-1 Design Decisions Christoforos E. Kozyrakis IRAM Retreat January 9, 2001.
Slide 1 ISTORE: System Support for Introspective Storage Appliances Aaron Brown, David Oppenheimer, and David Patterson Computer Science Division University.
Slide 1 ISTORE: Introspective Storage for Data-Intensive Network Services Aaron Brown, David Oppenheimer, Kimberly Keeton, Randi Thomas, Noah Treuhaft,
1 IRAM and ISTORE David Patterson, Katherine Yelick, John Kubiatowicz U.C. Berkeley, EECS
Slide 1 ISTORE: An Introspective Storage Architecture for Network Service Applications Aaron Brown, David Oppenheimer, Kimberly Keeton, Randi Thomas, Jim.
1 Lecture 10: FP, Performance Metrics Today’s topics:  IEEE 754 representations  FP arithmetic  Evaluating a system Reminder: assignment 4 due in a.
Welcome Three related projects at Berkeley –Intelligent RAM (IRAM) –Intelligent Storage (ISTORE) –OceanStore Groundrules –Questions are welcome during.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 1 Fundamentals of Quantitative Design and Analysis Computer Architecture A Quantitative.
Computer performance.
Lecture 2: Technology Trends and Performance Evaluation Performance definition, benchmark, summarizing performance, Amdahl’s law, and CPI.
Computer System Architectures Computer System Software
1 Fault Tolerance in the Nonstop Cyclone System By Scott Chan Robert Jardine Presented by Phuc Nguyen.
1 Chapter 04 Authors: John Hennessy & David Patterson.
Computers organization & Assembly Language Chapter 0 INTRODUCTION TO COMPUTING Basic Concepts.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Operating Systems David Goldschmidt, Ph.D. Computer Science The College of Saint Rose CIS 432.
Chapter 1 Computer Abstractions and Technology. Chapter 1 — Computer Abstractions and Technology — 2 The Computer Revolution Progress in computer technology.
Slide 1 IRAM and ISTORE Projects Aaron Brown, Jim Beck, Rich Fromm, Joe Gebis, Kimberly Keeton, Christoforos Kozyrakis, David Martin, Morley Mao, Rich.
CS2100 Computer Organisation Input/Output – Own reading only (AY2015/6) Semester 1 Adapted from David Patternson’s lecture slides:
Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 4 Computer Systems Review.
Compilers and Applications Kathy Yelick Dave Judd, Ronny Krashinsky, Randi Thomas, Samson Kwok, Simon Yau, Kar Ming Tang, Adam Janin, Thinh Nguyen Computer.
SPRING 2012 Assembly Language. Definition 2 A microprocessor is a silicon chip which forms the core of a microcomputer the concept of what goes into a.
Introduction to Computers - Hardware
William Stallings Computer Organization and Architecture 6th Edition
Lecture 2: Performance Evaluation
Chapter 2: Computer-System Structures(Hardware)
Chapter 2: Computer-System Structures
Memory COMPUTER ARCHITECTURE
Hardware Technology Trends and Database Opportunities
Microarchitecture.
Rough Schedule 1:30-2:15 IRAM overview 2:15-3:00 ISTORE overview break
Chapter 1: Introduction
Berkeley Cluster: Zoom Project
Architecture & Organization 1
CS703 - Advanced Operating Systems
Morgan Kaufmann Publishers
Scaling for the Future Katherine Yelick U.C. Berkeley, EECS
Vector Processing => Multimedia
IRAM and ISTORE Projects
Computers for the Post-PC Era
Architecture & Organization 1
Welcome Three related projects at Berkeley Groundrules Introductions
Input-output I/O is very much architecture/system dependent
Comparison of Two Processors
Chapter Overview CD-ROM and DVD Drives Advanced Hard Disk Drives
Computer-System Architecture
Module 2: Computer-System Structures
Today’s agenda Hardware architecture and runtime system
Welcome to Architectures of Digital Systems
Computer Evolution and Performance
What is Computer Architecture?
Modified from notes by Saeid Nooshabadi
Co-designed Virtual Machines for Reliable Computer Systems
Chapter 2: Computer-System Structures
Chapter 2: Computer-System Structures
Module 2: Computer-System Structures
The University of Adelaide, School of Computer Science
Computer Architecture
Module 2: Computer-System Structures
Chapter 13: I/O Systems.
IRAM Vision Microprocessor & DRAM on a single chip:
Chapter 13: I/O Systems “The two main jobs of a computer are I/O and [CPU] processing. In many cases, the main job is I/O, and the [CPU] processing is.
Presentation transcript:

IRAM and ISTORE Projects Aaron Brown, James Beck, Rich Fromm, Joe Gebis, Paul Harvey, Adam Janin, Dave Judd, Kimberly Keeton, Christoforos Kozyrakis, David Martin, Rich Martin, Thinh Nguyen, David Oppenheimer, Steve Pope, Randi Thomas, Noah Treuhaft, Sam Williams, John Kubiatowicz, Kathy Yelick, and David Patterson http://iram.cs.berkeley.edu/[istore] Winter 2000 IRAM/ISTORE Retreat

IRAM Vision: Intelligent PDA Pilot PDA + gameboy, cell phone, radio, timer, camera, TV remote, am/fm radio, garage door opener, ... + Wireless data (WWW) + Speech, vision, video + Voice output for conversations Speech control +Vision to see, scan documents, read bar code, ...

ISTORE Hardware Vision System-on-a-chip enables computer, memory, without significantly increasing size of disk 5-7 year target: MicroDrive:1.7” x 1.4” x 0.2” 2006: ? 1999: 340 MB, 5400 RPM, 5 MB/s, 15 ms seek 2006: 9 GB, 50 MB/s ? (1.6X/yr capacity,1.4X/yr BW) Integrated IRAM processor 2x height Connected via crossbar switch growing like Moore’s law 10,000+ nodes in one rack!

VIRAM: System on a Chip Prototype scheduled for tape-out 1H 2000 0.18 um EDL process 16 MB DRAM, 8 banks MIPS Scalar core and caches @ 200 MHz 4 64-bit vector unit pipelines @ 200 MHz 4 100 MB parallel I/O lines 17x17 mm, 2 Watts 25.6 GB/s memory (6.4 GB/s per direction and per Xbar) 1.6 Gflops (64-bit), 6.4 GOPs (16-bit) Memory (64 Mbits / 8 MBytes) 4 Vector Pipes/Lanes C P U +$ Xbar I/O Memory (64 Mbits / 8 MBytes)

IRAM Architecture Update ISA mostly frozen since 6/99 better fixed-point model and instructions gained some experience using them over past year better exception model better support for short vectors auto-increment memory addressing instructions for in-register reductions & butterfly-permutations memory consistency model spec refined (poster) Suite of simulators actively used and maintained vsim-isa (functional), vsim-p (performance), vsim-db (debugger), vsim-sync (memory synchronization)

IRAM Software Update Vectorizing Compiler for VIRAM retargeting CRAY vectorizing compiler (talk) Initial backend complete: scalar and vector instructions Extensive testing for correct functionality Instruction scheduling and performance tuning begun Applications using compiler underway Speech processing (talk) Small benchmarks; suggestions welcome Hand-coded fixed point applications Video encoder application complete (poster) FFT, floating point done, fixed point started (talk)

IRAM Chip Update IBM to supply embedded DRAM/Logic (98%) DRAM macro added to 0.18 micron logic process DRAM specs under NDA; final agreement in UCB bureaucracy MIPS to supply scalar core (99%) MIPS processor, caches, TLB MIT to supply FPU (100%) single precision (32 bit) only VIRAM-1 Tape-out scheduled for mid-2000 Some updates of micro-architecture based on benchmarks (talk) Layout of multiplier (poster), register file nearly complete Test strategy developed (talk) Demo system high level hardware design complete (talk) Network interface design complete (talk)

VIRAM-1 block diagram

Microarchitecture configuration 2 arithmetic units both execute integer operations one executes FP operations 4 64-bit datapaths (lanes) per unit 2 flag processing units for conditional execution and speculation support 1 load-store unit optimized for strides 1,2,3, and 4 4 addresses/cycle for indexed and strided operations decoupled indexed and strided stores Memory system 8 DRAM banks 256-bit synchronous interface 1 sub-bank per bank 16 Mbytes total capacity Peak performance 3.2 GOPS64, 12.8 GOPS16 (w. madd) 1.6 GOPS64, 6.4 GOPS16 (wo. madd) 0.8 GFLOPS64, 3.2 GFLOPS32 (w. madd) 6.4 Gbyte/s memory bandwidth

Media Kernel Performance

Base-line system comparison All numbers in cycles/pixel MMX and VIS results assume all data in L1 cache

Scaling to 10K Processors IRAM + micro-disk offer huge scaling opportunities Still many hard system problems, SAM AME (talk) Availability 24 x7 databases without human intervention Discrete vs. continuous model of machine being up Maintainability 42% of system failures are due to administrative errors self-monitoring, tuning, and repair Evolution Dynamic scaling with plug-and-play components Scalable performance, gracefully down as well as up Machines become heterogeneous in performance at scale

ISTORE-1: Hardware for AME Intelligent Disk “Brick” Hardware: plug-and-play intelligent devices with self-monitoring, diagnostics, and fault injection hardware intelligence used to collect and filter monitoring data diagnostics and fault injection enhance robustness networked to create a scalable shared-nothing cluster Intelligent Disk “Brick” Portable PC Processor: Pentium II+ DRAM Redundant NICs (4 100 Mb/s links) Diagnostic Processor Intelligent Chassis 80 nodes, 8 per tray 2 levels of switches 20 100 Mb/s 2 1 Gb/s Environment Monitoring: UPS, redundant PS, fans, heat and vibrartion sensors... Disk Half-height canister

ISTORE Brick Block Diagram Mobile Pentium II Module SCSI North Bridge CPU Disk (18 GB) South Bridge Diagnostic Net DUAL UART DRAM 256 MB Super I/O Monitor & Control Diagnostic Processor BIOS Ethernets 4x100 Mb/s PCI Sensors for heat and vibration Control over power to individual nodes Flash RTC RAM

ISTORE Software Approach Two-pronged approach to providing reliability: 1) reactive self-maintenance: dynamic reaction to exceptional system events self-diagnosing, self-monitoring hardware software monitoring and problem detection automatic reaction to detected problems 2) proactive self-maintenance: continuous online self- testing and self-analysis automatic characterization of system components in situ fault injection, self-testing, and scrubbing to detect flaky hardware components and to exercise rarely-taken application code paths before they’re used

ISTORE Applications Storage-intensive, reliable services for ISTORE-1 infrastructure for “thin clients,” e.g., PDAs web services, such as mail and storage large-scale databases (talk) information retrieval (search and on-the-fly indexing) Scalable memory-intensive computations for ISTORE in 2006 Performance estimates through IRAM simulation + model not major emphasis Large-scale defense and scientific applications enabled by high memory bw and arithmetic performance

Performance Availability System performance limited by the weakest link NOW Sort experience: performance heterogeneity is the norm disks: inner vs. outer track (50%), fragmentation processors: load (1.5-5x) and heat Virtual Streams: dynamically off-load I/O work from slower disks to faster ones

ISTORE Update High level hardware design by UCB complete (talk) Design of ISTORE boards handed off to Anigma First run complete; SCSI problem to be fixed Testing of UCB design (DP), to start asap 10 nodes by end of 1Q 2000, 80 by 2Q 2000 Design of BIOS handed off to AMI Most parts donated or discounted Adaptec, Andataco, IBM, Intel, Micron, Motorola, Packet Engines Proposal for Quantifying AME (talk) Beginning work on short-term applications Mail server Web server will be used to Large database drive principled Decision support primitives system design

Conclusions IRAM attractive for two Post-PC applications because of low power, small size, high memory bandwidth Mobile consumer electronic devices Scaleable infrastructure IRAM benchmarking result: faster than DSPs ISTORE: hardware/software architecture for large scale network services Scaling systems requires new continuous models of availability performance not limited by the weakest link self* systems to reduce human interaction [Still just a vision:] the things I’ve been talking about have not yet been implemented.

Backup Slides

Introduction and Ground Rules Who is here? Mixed IRAM/ISTORE “experience” Questions are welcome during talks Schedule: lecture from Brewster Kahle during Thursday’s Open Mic Session. Feedback is required (Fri am) Be careful, we have been known to listen to you Mixed experience: please ask Time for skiing and talking tomorrow afternoon

2006 ISTORE ISTORE node Add 20% pad to MicroDrive size for packaging, connectors Then double thickness to add IRAM 2.0” x 1.7” x 0.5” (51 mm x 43 mm x 13 mm) Crossbar switches growing by Moore’s Law 2x/1.5 yrs  4X transistors/3yrs Crossbars grow by N2  2X switch/3yrs 16 x 16 in 1999  64 x 64 in 2005 ISTORE rack (19” x 33” x 84”) 1 tray (3” high)  16 x 32  512 ISTORE nodes / try 20 trays+switches+UPS  10,240 ISTORE nodes / rack (!)

IRAM/VSUIF Decryption (IDEA) # lanes Virtual processor width IDEA Decryption operates on 16-bit ints Compiled with IRAM/VSUIF Note scalability of both #lanes and data width Some hand-optimizations (unrolling) will be automated by Cray compiler

1D FFT on IRAM FFT study on IRAM bit-reversal time included; cost hidden using indexed store Faster than DSPs on floating point (32-bit) FFTs CRI Pathfinder does 24-bit fixed point, 1K points in 28 usec (2 Watts without SRAM)

3D FFT on ISTORE 2006 speed of 1D FFT on a single node (next slide) Performance of large 3D FFT’s depend on 2 factors speed of 1D FFT on a single node (next slide) network bandwidth for “transposing” data 1.3 Tflop FFT possible w/ 1K IRAM nodes, if network bisection bandwidth scales (!)

ISTORE-1 System Layout Brick shelf Brick shelf Brick shelf Brick shelf

V-IRAM1: 0. 18 µm, Fast Logic, 200 MHz 1. 6 GFLOPS(64b)/6 V-IRAM1: 0.18 µm, Fast Logic, 200 MHz 1.6 GFLOPS(64b)/6.4 GOPS(16b)/32MB + x 2-way Superscalar Vector 4 x 64 or 8 x 32 16 x 16 Instruction ÷ Processor Queue I/O 100MB each Load/Store 16K I cache 16K D cache Vector Registers 4 x 64 4 x 64 1Gbit technology Put in perspective 10X of Cray T90 today Memory Crossbar Switch M … 4 x 64

Fixed-point multiply-add model Multiply half word & Shift & Round Add & Saturate z x n n/2 + w sat * n n Round y n n/2 a Same basic model, different set of instructions fixed-point: multiply & shift & round, shift right & round, shift left & saturate integer saturated arithmetic: add or sub & saturate added multiply-add instruction for improved performance and energy consumption

Other ISA modifications Auto-increment loads/stores a vector load/store can post-increment its base address added base (16), stride (8), and increment (8) registers necessary for applications with short vectors or scaled-up implementations Butterfly permutation instructions perform step of a butterfly permutation within a vector register used for FFT and reduction operations Miscellaneous instructions added min and max instructions (integer and FP) FP reciprocal and reciprocal square root

Major architecture updates Integer arithmetic units support multiply-add instructions 1 load store unit complexity Vs. benefit Optimize for strides 2, 3, and 4 useful for complex arithmetic and image processing functions Decoupled strided and indexed stores memory stalls due to bank conflicts do not stall the arithmetic pipelines allows scheduling of independent arithmetic operations in parallel with stores that experience many stalls implemented with address, not data, buffering currently examining a similar optimization for loads

Micro-kernel results: simulated systems Note : simulations performed with 2 load-store units and without decoupled stores or optimizations for strides 2, 3, and 4

Micro-kernels Vectorization and scheduling performed manually

Scaled system results Near linear speedup for all application apart from iDCT iDCT bottlenecks large number of bank conflicts 4 addresses/cycle for strided accesses

iDCT scaling with sub-banks Sub-banks reduce bank conflicts and increase performance Alternative (but not as effective) ways to reduce conflicts: different memory layout different address interleaving schemes

Compiling for VIRAM Long-term success of DIS technology depends on simple programming model, i.e., a compiler Needs to handle significant class of applications IRAM: multimedia, graphics, speech and image processing ISTORE: databases, signal processing, other DIS benchmarks Needs to utilize hardware features for performance IRAM: vectorization ISTORE: scalability of shared-nothing programming model

IRAM Compilers IRAM/Cray vectorizing compiler [Judd] Production compiler Used on the T90, C90, as well as the T3D and T3E Being ported (by SGI/Cray) to the SV2 architecture Has C, C++, and Fortran front-ends (focus on C) Extensive vectorization capability outer loop vectorization, scatter/gather, short loops, … VIRAM port is under way IRAM/VSUIF vectorizing compiler [Krashinsky] Based on VSUIF from Corinna Lee’s group at Toronto which is based on MachineSUIF from Mike Smith’s group at Harvard which is based on SUIF compiler from Monica Lam’s group at Stanford This is a “research” compiler, not intended for compiling large complex applications It has been working since 5/99.

IRAM/Cray Compiler Status Vectorizer C Fortran C++ Frontends Code Generators PDGCS IRAM C90 MIPS backend developed in this year Validated using a commercial test suite for code generation Vector backend recently started Testing with simulator under way Leveraging from Cray Automatic vectorization

VIRAM/VSUIF Matrix/Vector Multiply VIRAM/VSUIF does reasonably well on long loops 256x256 single matrix Compare to 1600 Mflop/s (peak without multadd) Note BLAS-2 (little reuse) ~350 on Power3 and EV6 Problems specific to VSUIF hand strip-mining results in short loops reductions no multadd support mvm vmm

Reactive Self-Maintenance ISTORE defines a layered system model for monitoring and reaction: Reaction mechanisms Provided by Application ISTORE API Coordination of reaction Policies Provided by ISTORE Runtime System Problem detection SW monitoring Self-monitoring hardware ISTORE API defines interface between runtime system and app. reaction mechanisms Policies define system’s monitoring, detection, and reaction behavior

Proactive Self-Maintenance Continuous online self-testing of HW and SW detects flaky, failing, or buggy components via: fault injection: triggering hardware and software error handling paths to verify their integrity/existence stress testing: pushing HW/SW components past normal operating parameters scrubbing: periodic restoration of potentially “decaying” hardware or software state automates preventive maintenance Dynamic HW/SW component characterization used to adapt to heterogeneous hardware and behavior of application software components

ISTORE-0 Prototype and Plans ISTORE-0: testbed for early experimentation with ISTORE research ideas Hardware: cluster of 6 PCs intended to model ISTORE-1 using COTS components nodes interconnected using ISTORE-1 network fabric custom fault-injection hardware on subset of nodes Initial research plans runtime system software fault injection scalability, availability, maintainability benchmarking applications: block storage server, database, FFT

Runtime System Software Demonstrate simple policy-driven adaptation within context of a single OS and application software monitoring information collected and processed in realtime e.g., health & performance parameters of OS, application problem detection and coordination of reaction controlled by a stock set of configurable policies application-level adaptation mechanisms invoked to implement reaction Use experience to inform ISTORE API design Investigate reinforcement learning as technique to infer appropriate reactions from goals

Record-breaking performance is not the common case NOW-Sort records demonstrate peak performance But perturb just 1 of 8 nodes and... Records set at 4AM!

Virtual Streams: Dynamic load balancing for I/O Replicas of data serve as second sources Maintain a notion of each process’s progress Arbitrate use of disks to ensure equal progress The right behavior, but what mechanism? Process Virtual Streams Software Disk Arbiter

Graduated Declustering: A Virtual Streams implementation Clients send progress, servers schedule in response To Client0 Before Slowdown After Slowdown 1 2 3 B Client1 Client2 Client3 Server0 Server1 Server2 Server3 From B/2 7B/8 3B/8 5B/8 B/4

Read Performance: Multiple Slow Disks

Storage Priorities: Research v. Users Traditional Research Priorities 1) Performance 1’) Cost 3) Scalability 4) Availability 5) Maintainability ISTORE Priorities 1) Maintainability 2) Availability 3) Scalability 4) Performance 5) Cost } easy to measure } hard to measure

Intelligent Storage Project Goals ISTORE: a hardware/software architecture for building scaleable, self-maintaining storage An introspective system: it monitors itself and acts on its observations Self-maintenance: does not rely on administrators to configure, monitor, or tune system

Self-maintenance Failure management devices must fail fast without interrupting service predict failures and initiate replacement failures  immediate human intervention System upgrades and scaling new hardware automatically incorporated without interruption new devices immediately improve performance or repair failures Performance management system must adapt to changes in workload or access patterns

ISTORE-I: 2H99 Intelligent disk Portable PC Hardware: Pentium II, DRAM Low Profile SCSI Disk (9 to 18 GB) 4 100-Mbit/s Ethernet links per node Placed inside Half-height canister Monitor Processor/path to power off components? Intelligent Chassis 64 nodes: 8 enclosures, 8 nodes/enclosure 64 x 4 or 256 Ethernet ports 2 levels of Ethernet switches: 14 small, 2 large Small: 20 100-Mbit/s + 2 1-Gbit; Large: 25 1-Gbit Just for prototype; crossbar chips for real system Enclosure sensing, UPS, redundant PS, fans, ...

Disk Limit Continued advance in capacity (60%/yr) and bandwidth (40%/yr) Slow improvement in seek, rotation (8%/yr) Time to read whole disk Year Sequentially Randomly (1 sector/seek) 1990 4 minutes 6 hours 1999 35 minutes 1 week(!) 3.5” form factor make sense in 5-7 years?

Related Work ISTORE adds to several recent research efforts Active Disks, NASD (UCSB, CMU) Network service appliances (NetApp, Snap!, Qube, ...) High availability systems (Compaq/Tandem, ...) Adaptive systems (HP AutoRAID, M/S AutoAdmin, M/S Millennium) Plug-and-play system construction (Jini, PC Plug&Play, ...)

Other (Potential) Benefits of ISTORE Scalability: add processing power, memory, network bandwidth as add disks Smaller footprint vs. traditional server/disk Less power embedded processors vs. servers spin down idle disks? For decision-support or web-service applications, potentially better performance than traditional servers

Disk Limit: I/O Buses CPU C Memory C C C (15 disks) Controllers Cannot use 100% of bus Queuing Theory (< 70%) Command overhead (Effective size = size x 1.2) Multiple copies of data, SW layers CPU Memory bus C Internal I/O bus Memory C External I/O bus (PCI) Bus rate vs. Disk rate SCSI: Ultra2 (40 MHz), Wide (16 bit): 80 MByte/s FC-AL: 1 Gbit/s = 125 MByte/s (single disk in 2002) C (SCSI) C (15 disks) Controllers

State of the Art: Seagate Cheetah 36 36.4 GB, 3.5 inch disk 12 platters, 24 surfaces 10,000 RPM 18.3 to 28 MB/s internal media transfer rate (14 to 21 MB/s user data) 9772 cylinders (tracks), (71,132,960 sectors total) Avg. seek: read 5.2 ms, write 6.0 ms (Max. seek: 12/13,1 track: 0.6/0.9 ms) $2100 or 17MB/$ (6¢/MB) (list price) 0.15 ms controller time source: www.seagate.com

User Decision Support Demand vs. Processor speed Database demand: 2X / 9-12 months “Greg’s Law” Database-Proc. Performance Gap: “Moore’s Law” CPU speed 2X / 18 months Moore’s Law is a laggard 250%/year for Greg 60%/year for Moore 7%/year for DRAM Decision support is linear in database size