Slide 1 Computers for the Post-PC Era David Patterson University of California at Berkeley UC Berkeley IRAM Group UC Berkeley.

Slides:

Advertisements

Similar presentations

U Computer Systems Research: Past and Future u Butler Lampson u People have been inventing new ideas in computer systems for nearly four decades, usually.

Advertisements

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

Khaled A. Al-Utaibi  Computers are Every Where  What is Computer Engineering?  Design Levels  Computer Engineering Fields  What.

Slide 1 Perspective on Post-PC Era PostPC Era will be driven by 2 technologies: 1) Mobile Consumer Devices –e.g., successor to PDA, cell phone, wearable.

7/14/2000 Page 1 Design of the IRAM FPU Ioannis Mavroidis IRAM retreat July 12-14, 2000.

Slide 1 ISTORE-1 Update David Patterson University of California at Berkeley UC Berkeley IRAM Group UC Berkeley ISTORE Group.

Slide 1 Patterson’s Projects, People, Impact Reduced Instruction Set Computer (RISC) –What: simplified instructions to exploit VLSI: ‘80-’84 –With:

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

RISC By Don Nichols. Contents Introduction History Problems with CISC RISC Philosophy Early RISC Modern RISC.

VIRAM-1 Architecture Update and Status Christoforos E. Kozyrakis IRAM Retreat January 2000.

Chapter 1 and 2 Computer System and Operating System Overview

EET 4250: Chapter 1 Performance Measurement, Instruction Count & CPI Acknowledgements: Some slides and lecture notes for this course adapted from Prof.

Slide 1 Computers for the Post-PC Era David Patterson University of California at Berkeley UC Berkeley IRAM Group UC Berkeley.

Retrospective on the VIRAM-1 Design Decisions Christoforos E. Kozyrakis IRAM Retreat January 9, 2001.

Chapter 3 Chapter 3: Server Hardware. Chapter 3 Learning Objectives n Describe the base system requirements for Windows NT 4.0 Server n Explain how to.

GCSE Computing - The CPU

Slide 1 Computers for the Post-PC Era David Patterson, Katherine Yelick University of California at Berkeley UC Berkeley IRAM.

CalStan 3/2011 VIRAM-1 Floorplan – Tapeout June 01 Microprocessor –256-bit media processor –12-14 MBytes DRAM – Gops –2W at MHz –Industrial.

1 IRAM and ISTORE David Patterson, Katherine Yelick, John Kubiatowicz U.C. Berkeley, EECS

1 Lecture 10: FP, Performance Metrics Today’s topics:  IEEE 754 representations  FP arithmetic  Evaluating a system Reminder: assignment 4 due in a.

Router Architectures An overview of router architectures.

Router Architectures An overview of router architectures.

COM181 Computer Hardware Ian McCrumRoom 5B18,

1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.

CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.

Emotion Engine A look at the microprocessor at the center of the PlayStation2 gaming console Charles Aldrich.

Computer performance.

Lecture 2: Technology Trends and Performance Evaluation Performance definition, benchmark, summarizing performance, Amdahl’s law, and CPI.

CS294, YelickISTORE, p1 CS ISTORE: Hardware Overview and Software Challenges

Part 1.  Intel x86/Pentium family  32-bit CISC processor  SUN SPARC and UltraSPARC  32- and 64-bit RISC processors  Java  C  C++  Java  Why Java?

Computer Processing of Data

EEL 5708 Main Memory Organization Lotzi Bölöni Fall 2003.

Introduction CSE 410, Spring 2008 Computer Systems

Computers organization & Assembly Language Chapter 0 INTRODUCTION TO COMPUTING Basic Concepts.

Computer Organization and Design Computer Abstractions and Technology

Part 1.  Intel x86/Pentium family  32-bit CISC processor  SUN SPARC and UltraSPARC  32- and 64-bit RISC processors  Java  C  C++  Java  Why Java?

Computer Architecture And Organization UNIT-II General System Architecture.

Computer Organization & Assembly Language © by DR. M. Amer.

Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.

Slide 1 IRAM and ISTORE Projects Aaron Brown, Jim Beck, Rich Fromm, Joe Gebis, Kimberly Keeton, Christoforos Kozyrakis, David Martin, Morley Mao, Rich.

1 chapter 1 Computer Architecture and Design ECE4480/5480 Computer Architecture and Design Department of Electrical and Computer Engineering University.

Chapter 13 – I/O Systems (Pgs ). Devices  Two conflicting properties A. Growing uniformity in interfaces (both h/w and s/w): e.g., USB, TWAIN.

Lecture on Central Process Unit (CPU)

Input/Output Problems Wide variety of peripherals —Delivering different amounts of data —At different speeds —In different formats All slower than CPU.

DR. SIMING LIU SPRING 2016 COMPUTER SCIENCE AND ENGINEERING UNIVERSITY OF NEVADA, RENO Session 2 Computer Organization.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.

Slide 1 ISTORE Update David Patterson University of California at Berkeley UC Berkeley IRAM Group UC Berkeley ISTORE Group

VU-Advanced Computer Architecture Lecture 1-Introduction 1 Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 1.

Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.

Hardware Architecture

Introduction CSE 410, Spring 2005 Computer Systems

Computers for the Post-PC Era

Lecture 2: Performance Evaluation

Memory COMPUTER ARCHITECTURE

Types of Operating System

CSE 410, Spring 2006 Computer Systems

Rough Schedule 1:30-2:15 IRAM overview 2:15-3:00 ISTORE overview break

Berkeley Cluster: Zoom Project

Architecture & Organization 1

Scaling for the Future Katherine Yelick U.C. Berkeley, EECS

Architecture & Organization 1

Computers for the Post-PC Era

Welcome to Architectures of Digital Systems

Computer Evolution and Performance

Chapter 4 Multiprocessors

ISTORE Update David Patterson University of California at Berkeley

IRAM Vision Microprocessor & DRAM on a single chip:

Presentation transcript:

Slide 1 Computers for the Post-PC Era David Patterson University of California at Berkeley UC Berkeley IRAM Group UC Berkeley ISTORE Group May 2000

Slide 2 Perspective on Post-PC Era PostPC Era will be driven by 2 technologies: 1) “Gadgets”:Tiny Embedded or Mobile Devices –ubiquitous: in everything –e.g., successor to PDA, cell phone, wearable computers 2) Infrastructure to Support such Devices –e.g., successor to Big Fat Web Servers, Database Servers

Slide 3 VIRAM-1 Block Diagram

Slide 4 VIRAM-1: System on a Chip Prototype scheduled for tape-out mid um EDL process 16 MB DRAM, 8 banks MIPS Scalar core and 200 MHz 4 64-bit vector unit 200 MHz MB parallel I/O lines 17x17 mm, 2 Watts 25.6 GB/s memory (6.4 GB/s per direction and per Xbar) 1.6 Gflops (64-bit), 6.4 GOPs (16-bit) C P U +$ I/O 4 Vector Pipes/Lanes Memory (64 Mbits / 8 MBytes) Xbar

Slide 5 Problem: General Element Permutation Hardware for a full vector permutation instruction (128 16b elements, 256b datapath) Datapath: 16 x 16 (x 16b) crossbar; scales by 0(N^2) Control: to-1 multiplexors; scales by 0(N*logN) Other problems –Consecutive result elements not written together; time/energy wasted on wide vector register file port

Slide 6 Simple Vector Permutations Simple steps of butterfly permutations –A register provides the butterfly radix –Separate instructions for moving elements to left/right Sufficient semantics for –Fast reductions of vector registers (dot products) –Fast FFT/DCT kernels 0115

Slide 7 Hardware for Simple Permutations Hardware for b elements, 256b datapath Datapath: 2 buses, 8 tristate drivers, 4 multiplexors, 4 shifters (by 0, 16b, 32b only); Scales by O(N) Control: 6 control cases; scales by O(N) Other benefits –Consecutive result elements written together; –Buses used only for small radices 64 shift 64 03

Slide 8 FFT: Straight forward Problem: most time spent in short vectors in later stages of FFT

Slide 9 FFT: Transpose inside Vector Regs

Slide 10 FFT: Straight forward

Slide 11 VIRAM-1 Design Status MIPS scalar core –Synthesizable RTL code received from MIPS –Cache RAMs to be compiled for IBM technology –FPU RTL code almost compete Vector unit –RTL models for sub-blocks developed; currently integrated and tested –Control logic to be compiled for IBM technology –Full-custom layout for multipliers/adders developed; layout for shifters to be developed Memory system –Synthesizable model for DRAM controllers done –To be integrated with IBM DRAM macros –Full-custom layout for crossbar under development Testing infrastructure –Environment developed for automatic test & validation –Directed tests for single/multiple instruction groups developed –Random instruction sequence generator developed

Slide 12 Executes MIPS IV ISA single-precision FP instructions Thirty-two 32-bit Floating Point Registers Two 32-bit Control Registers One 3-cycle (division takes 10 cycles) fully pipelined, nearly full IEEE-754 compliant, execution unit (from Albert 6-stage pipeline (R-X-X-X-CDB-WB) Support for partial out-of-order execution and precise exceptions Scalar Core dispatches FP instructions to FPU using an interface that splits instructions into 3 classes: –Arithmetic instructions (ADD.S, SUB.S, MUL.S, DIV.S, ABS.S, NEG.S, C.cond.S, CVT.S.W, CVT.W.S, TRUNC.W.S, MOV.S, MOVZ.S, MOVN.S) –From Coprocessor Data Transfer instructions (SWC1, MFC1, CFC1) –To Coprocessor Data Transfer instructions (LWC1, MTC1, CTC1) FPU Features

Slide 13 FPU Architecture

Slide 14 Multiplier Partitioning 64-bit multiplier built from 16-bit multiplier subblocks Subblocks combined with adders to perform larger multiplies Performs 2 simultaneous 32- bit multiplies by grouping 4 subblocks Performs 4 simultaneous 16- bit multiplies by using individual subblocks Unused blocks turned off to conserve power

Slide 15 FPU Current Status Current Functionality –Able to execute most instructions (all except C.cond.S, CFC1 and CTC1). –Supports precise exception semantics. –Functionality verification. »Used a random test generator that generates/kills instructions at random and compares the results from the RTL Verilog simulator against the results from an ISA Perl simulator. What remains to be done –Instructions that use the Control Registers (C.cond.S, CFC1 and CTC1). –Exception generation. –Integrate execution pipeline with the rest of the design. –Synthesize, place and route. –Final assembly and verification of multiplier Performance –Sustainable Throughput: 1 instruction/cycle (assuming no data hazards) –Instruction Latency: 6 cycles

Slide 16 UC-IBM Agreement Biggest IRAM Obstacle: Intellectual Property Agreement between University of California and IBM Can university accept free fab costs ($2.0M to $2.5M) in return for capped non-exclusive patent licensing fees for IBM if UC files for IRAM patents? Process started with IBM March 1999 IBM won’t give full process info until contract UC started negotiating seriously Jan 2000 Agreement June 1, 2000!

Slide 17 Other examples: IBM “Blue Gene” 1 PetaFLOPS in 2005 for $100M? Application: Protein Folding Blue Gene Chip –32 Multithreaded RISC processors + ??MB Embedded DRAM + high speed Network Interface on single 20 x 20 mm chip –1 GFLOPS / processor 2’ x 2’ Board = 64 chips (2K CPUs) Rack = 8 Boards (512 chips,16K CPUs) System = 64 Racks (512 boards,32K chips,1M CPUs) Total 1 million processors in just 2000 sq. ft.

Slide 18 Other examples: Sony Playstation 2 Emotion Engine: 6.2 GFLOPS, 75 million polygons per second (Microprocessor Report, 13:5) –Superscalar MIPS core + vector coprocessor + graphics/DRAM –Claim: “Toy Story” realism brought to games

Slide 19 Outline 1) Example microprocessor for PostPC gadgets 2) Motivation and the ISTORE project vision –AME: Availability, Maintainability, Evolutionary growth –ISTORE’s research principles –Benchmarks for AME Conclusions and future work

Slide 20 Lampson: Systems Challenges Systems that work –Meeting their specs –Always available –Adapting to changing environment –Evolving while they run –Made from unreliable components –Growing without practical limit Credible simulations or analysis Writing good specs Testing Performance –Understanding when it doesn’t matter “Computer Systems Research -Past and Future” Keynote address, 17th SOSP, Dec Butler Lampson Microsoft

Slide 21 Hennessy: What Should the “New World” Focus Be? Availability –Both appliance & service Maintainability –Two functions: »Enhancing availability by preventing failure »Ease of SW and HW upgrades Scalability –Especially of service Cost –per device and per service transaction Performance –Remains important, but its not SPECint “Back to the Future: Time to Return to Longstanding Problems in Computer Systems?” Keynote address, FCRC, May 1999 John Hennessy Stanford

Slide 22 The real scalability problems: AME Availability –systems should continue to meet quality of service goals despite hardware and software failures Maintainability –systems should require only minimal ongoing human administration, regardless of scale or complexity Evolutionary Growth –systems should evolve gracefully in terms of performance, maintainability, and availability as they are grown/upgraded/expanded These are problems at today’s scales, and will only get worse as systems grow

Slide 23 Principles for achieving AME (1) No single points of failure Redundancy everywhere Performance robustness is more important than peak performance –“performance robustness” implies that real-world performance is comparable to best-case performance Performance can be sacrificed for improvements in AME –resources should be dedicated to AME »compare: biological systems spend > 50% of resources on maintenance –can make up performance by scaling system

Slide 24 Principles for achieving AME (2) Introspection –reactive techniques to detect and adapt to failures, workload variations, and system evolution –proactive techniques to anticipate and avert problems before they happen

Slide 25 ISTORE-1 hardware platform 80-node x86-based cluster, 1.4TB storage –cluster nodes are plug-and-play, intelligent, network- attached storage “bricks” »a single field-replaceable unit to simplify maintenance –each node is a full x86 PC w/256MB DRAM, 18GB disk –more CPU than NAS; fewer disks/node than cluster ISTORE Chassis 80 nodes, 8 per tray 2 levels of switches Mbit/s 2 1 Gbit/s Environment Monitoring: UPS, redundant PS, fans, heat and vibration sensors... Intelligent Disk “Brick” Portable PC CPU: Pentium II/266 + DRAM Redundant NICs (4 100 Mb/s links) Diagnostic Processor Disk Half-height canister

Slide 26 ISTORE-1 Status 10 Nodes manufactured Boots OS Diagnostic Processor Interface SW complete PCB backplane: not yet designed Finish 80 node system: Summer 2000

Slide 27 Hardware techniques Fully shared-nothing cluster organization –truly scalable architecture –architecture that tolerates partial failure –automatic hardware redundancy

Slide 28 Hardware techniques (2) No Central Processor Unit: distribute processing with storage –Serial lines, switches also growing with Moore’s Law; less need today to centralize vs. bus oriented systems –Most storage servers limited by speed of CPUs; why does this make sense? –Why not amortize sheet metal, power, cooling infrastructure for disk to add processor, memory, and network? –If AME is important, must provide resources to be used to help AME: local processors responsible for health and maintenance of their storage

Slide 29 Hardware techniques (3) Heavily instrumented hardware –sensors for temp, vibration, humidity, power, intrusion –helps detect environmental problems before they can affect system integrity Independent diagnostic processor on each node –provides remote control of power, remote console access to the node, selection of node boot code –collects, stores, processes environmental data for abnormalities –non-volatile “flight recorder” functionality –all diagnostic processors connected via independent diagnostic network

Slide 30 Hardware techniques (4) On-demand network partitioning/isolation –Internet applications must remain available despite failures of components, therefore can isolate a subset for preventative maintenance –Allows testing, repair of online system –Managed by diagnostic processor and network switches via diagnostic network

Slide 31 Hardware techniques (5) Built-in fault injection capabilities –Power control to individual node components –Injectable glitches into I/O and memory busses –Managed by diagnostic processor –Used for proactive hardware introspection »automated detection of flaky components »controlled testing of error-recovery mechanisms –Important for AME benchmarking (see next slide)

Slide 32 “Hardware” techniques (6) Benchmarking –One reason for 1000X processor performance was ability to measure (vs. debate) which is better »e.g., Which most important to improve: clock rate, clocks per instruction, or instructions executed? –Need AME benchmarks “what gets measured gets done” “benchmarks shape a field” “quantification brings rigor”

Slide 33 Availability benchmark methodology Goal: quantify variation in QoS metrics as events occur that affect system availability Leverage existing performance benchmarks –to generate fair workloads –to measure & trace quality of service metrics Use fault injection to compromise system –hardware faults (disk, memory, network, power) –software faults (corrupt input, driver error returns) –maintenance events (repairs, SW/HW upgrades) Examine single-fault and multi-fault workloads –the availability analogues of performance micro- and macro-benchmarks

Slide 34 Results are most accessible graphically –plot change in QoS metrics over time –compare to “normal” behavior? »99% confidence intervals calculated from no-fault runs Benchmark Availability? Methodology for reporting results

Slide 35 Example results: multiple-faults Windows reconstructs ~3x faster than Linux Windows reconstruction noticeably affects application performance, while Linux reconstruction does not Windows 2000/IIS Linux/ Apache

Slide 36 Conclusions (1): ISTORE Availability, Maintainability, and Evolutionary growth are key challenges for server systems –more important even than performance ISTORE is investigating ways to bring AME to large-scale, storage-intensive servers –via clusters of network-attached, computationally- enhanced storage nodes running distributed code –via hardware and software introspection –we are currently performing application studies to investigate and compare techniques Availability benchmarks a powerful tool? –revealed undocumented design decisions affecting SW RAID availability on Linux and Windows 2000

Slide 37 Conclusions (2) IRAM attractive for two Post-PC applications because of low power, small size, high memory bandwidth –Gadgets: Embedded/Mobile devices –Infrastructure: Intelligent Storage and Networks PostPC infrastructure requires –New Goals: Availability, Maintainability, Evolution –New Principles: Introspection, Performance Robustness –New Techniques: Isolation/fault insertion, Software scrubbing –New Benchmarks: measure, compare AME metrics