Slide 1 IRAM Original Plan A processor architecture for embedded/portable systems running media applications –Based on media processing and embedded DRAM.

Slides:



Advertisements
Similar presentations
Remus: High Availability via Asynchronous Virtual Machine Replication
Advertisements

Computer Architecture
THE RAW MICROPROCESSOR: A COMPUTATIONAL FABRIC FOR SOFTWARE CIRCUITS AND GENERAL- PURPOSE PROGRAMS Taylor, M.B.; Kim, J.; Miller, J.; Wentzlaff, D.; Ghodrat,
1 CS294 Project VIRAM-1 Verification Retreat – Winter 2001 Sam Williams.
Chapter 1 CSF 2009 Computer Performance. Defining Performance Which airplane has the best performance? Chapter 1 — Computer Abstractions and Technology.
Slide 1 Perspective on Post-PC Era PostPC Era will be driven by 2 technologies: 1) Mobile Consumer Devices –e.g., successor to PDA, cell phone, wearable.
7/14/2000 Page 1 Design of the IRAM FPU Ioannis Mavroidis IRAM retreat July 12-14, 2000.
SE 450 Software Processes & Product Metrics Reliability: An Introduction.
1 CSSE 477 – A bit more on Performance Steve Chenoweth Friday, 9/9/11 Week 1, Day 2 Right – Googling for “Performance” gets you everything from Lady Gaga.
1 BGL Photo (system) BlueGene/L IBM Journal of Research and Development, Vol. 49, No. 2-3.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
1: Operating Systems Overview
Memory Management 2010.
VIRAM-1 Architecture Update and Status Christoforos E. Kozyrakis IRAM Retreat January 2000.
Scalable Vector Coprocessor for Media Processing Christoforos Kozyrakis ( ) IRAM Project Retreat, July 12 th, 2000.
Figure 1.1 Interaction between applications and the operating system.
Retrospective on the VIRAM-1 Design Decisions Christoforos E. Kozyrakis IRAM Retreat January 9, 2001.
1 CS 501 Spring 2005 CS 501: Software Engineering Lecture 22 Performance of Computer Systems.
CalStan 3/2011 VIRAM-1 Floorplan – Tapeout June 01 Microprocessor –256-bit media processor –12-14 MBytes DRAM – Gops –2W at MHz –Industrial.
Swami NatarajanJuly 14, 2015 RIT Software Engineering Reliability: Introduction.
From Concept to Silicon How an idea becomes a part of a new chip at ATI Richard Huddy ATI Research.
8/16/2015\course\cpeg323-08F\Topics1b.ppt1 A Review of Processor Design Flow.
Computer performance.
1 Layers of Computer Science, ISA and uArch Alexander Titov 20 September 2014.
Computer Architecture ECE 4801 Berk Sunar Erkay Savas.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
1 Fault Tolerance in the Nonstop Cyclone System By Scott Chan Robert Jardine Presented by Phuc Nguyen.
LOGO OPERATING SYSTEM Dalia AL-Dabbagh
Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.
Lecture 03: Fundamentals of Computer Design - Trends and Performance Kai Bu
Introduction CSE 410, Spring 2008 Computer Systems
CAD for Physical Design of VLSI Circuits
Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.
1 Recap (from Previous Lecture). 2 Computer Architecture Computer Architecture involves 3 inter- related components – Instruction set architecture (ISA):
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Computer Organization and Design Computer Abstractions and Technology
Hardware-software Interface Xiaofeng Fan
1 Instruction Set Architecture (ISA) Alexander Titov 10/20/2012.
Lecture 10: Logic Emulation October 8, 2013 ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation.
Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
Chapter 1: Fundamental of Testing Systems Testing & Evaluation (MNN1063)
UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.
Playstation2 Architecture Architecture Hardware Design.
DR. SIMING LIU SPRING 2016 COMPUTER SCIENCE AND ENGINEERING UNIVERSITY OF NEVADA, RENO CS 219 Computer Organization.
DR. SIMING LIU SPRING 2016 COMPUTER SCIENCE AND ENGINEERING UNIVERSITY OF NEVADA, RENO Session 2 Computer Organization.
Operating Systems COT 4600 – Fall 2009 Dan C. Marinescu Office: HEC 439 B Office hours: W, F 3:00-4:00 PM.
Compsci Today’s topics l Operating Systems  Brookshear, Chapter 3  Great Ideas, Chapter 10  Slides from Kevin Wayne’s COS 126 course l Performance.
Chapter 3 System Buses.  Hardwired systems are inflexible  General purpose hardware can do different tasks, given correct control signals  Instead.
Interrupts and Exception Handling. Execution We are quite aware of the Fetch, Execute process of the control unit of the CPU –Fetch and instruction as.
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
Embedded Real-Time Systems Processing interrupts Lecturer Department University.
VU-Advanced Computer Architecture Lecture 1-Introduction 1 Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 1.
Introduction CSE 410, Spring 2005 Computer Systems
SPRING 2012 Assembly Language. Definition 2 A microprocessor is a silicon chip which forms the core of a microcomputer the concept of what goes into a.
Computer Organization and Architecture Lecture 1 : Introduction
Memory COMPUTER ARCHITECTURE
Rough Schedule 1:30-2:15 IRAM overview 2:15-3:00 ISTORE overview break
Morgan Kaufmann Publishers
Architecture & Organization 1
Scaling for the Future Katherine Yelick U.C. Berkeley, EECS
Processor (I).
Architecture & Organization 1
Computer Evolution and Performance
CSC3050 – Computer Architecture
COMS 361 Computer Organization
Sam Williams IRAM Summer Retreat 2000
CMSC 611: Advanced Computer Architecture
Presentation transcript:

Slide 1 IRAM Original Plan A processor architecture for embedded/portable systems running media applications –Based on media processing and embedded DRAM –Simple, scalable, and efficient –Good compiler target Microprocessor prototype with –256-bit media processor, 16 MBytes DRAM –150 million transistors, 290 mm 2 –3.2 Gops, 2W at 200 MHz –Industrial strength compiler –Implemented by 6 graduate students

Slide 2 Architecture Details Review MIPS64™ 5Kc core (200 MHz) –Single-issue core with 6 stage pipeline –8 KByte, direct-map instruction and data caches –Single-precision scalar FPU Vector unit (200 MHz) –8 KByte register file (32 64b elements per register) –4 functional units: »2 arithmetic (1 FP), 2 flag processing »256b datapaths per functional unit –Memory unit »4 address generators for strided/indexed accesses »2-level TLB structure: 4-ported, 4-entry microTLB and single-ported, 32-entry main TLB »Pipelined to sustain up to 64 pending memory accesses

Slide 3 Modular Vector Unit Design Single 64b “lane” design replicated 4 times –Reduces design and testing time –Provides a simple scaling model (up or down) without major control or datapath redesign Most instructions require only intra-lane interconnect –Tolerance to interconnect delay scaling 256b Control 64b Xbar IF Integer Datapath 0 Flag Reg. Elements & Datapaths Vector Reg. Elements FP Datapath Integer Datapath 1 64b Xbar IF Integer Datapath 0 Flag Reg. Elements & Datapaths Vector Reg. Elements FP Datapath Integer Datapath 1 64b Xbar IF Integer Datapath 0 Flag Reg. Elements & Datapaths Vector Reg. Elements FP Datapath Integer Datapath 1 64b Xbar IF Integer Datapath 0 Flag Reg. Elements & Datapaths Vector Reg. Elements FP Datapath Integer Datapath 1

Slide 4 Alternative Floorplans (1) “VIRAM-7MB” 4 lanes, 8 Mbytes 190 mm Gops at 200 MHz (32-bit ops) “VIRAM-2Lanes” 2 lanes, 4 Mbytes 120 mm Gops at 200 MHz “VIRAM-Lite” 1 lane, 2 Mbytes 60 mm Gops at 200 MHz

Slide 5 Power Consumption Power saving techniques –Low power supply for logic (1.2 V) »Possible because of the low clock rate (200 MHz) »Wide vector datapaths provide high performance –Extensive clock gating and datapath disabling »Utilizing the explicit parallelism information of vector instructions and conditional execution –Simple, single-issue, in-order pipeline Typical power consumption: 2.0 W –MIPS core:0.5 W –Vector unit:1.0 W (min ~0 W) –DRAM: 0.2 W (min ~0 W) –Misc.:0.3 W (min ~0 W)

Slide 6 VIRAM Compiler Based on the Cray’s PDGCS production environment for vector supercomputers Extensive vectorization and optimization capabilities including outer loop vectorization No need to use special libraries or variable types for vectorization Optimizer C Fortran95 C++ FrontendsCode Generators Cray’s PDGCS T3D/T3E SV2/VIRAM C90/T90/SV1

Slide 7 The IRAM Team Hardware: –Joe Gebis, Christoforos Kozyrakis, Ioannis Mavroidis, Iakovos Mavroidis, Steve Pope, Sam Williams Software: –Alan Janin, David Judd, David Martin, Randi Thomas Advisors: –David Patterson, Katherine Yelick Help from: –IBM Microelectronics, MIPS Technologies, Cray, Avanti

Slide 8 IRAM update Verification of chip Scheduled tape-out Package Clock cycle time/Power Estimates Demo board

Slide 9 Current Debug / Verification Efforts Current m5kc+fpu :program simulation on RTL m5kc+vu+xbar+dram :program simulation on RTL Arithmetic Unit (AU) : corner cases + random values on VERILOG netlist Vector Register File : only a few cases have been (layout) spiced, 100’s of tests were run thru timemill. To Do Entire VIRAM-1 :program simulation on RTL (m5kc+vu+fpu+xbar+dram)

Slide 10 Progress vsim m5kc+vu+ xbar+dram m5kc+fpu m5kc ISAXC’s Arith.  KernelsTLB  Kernels randomcompiled MIPS Testsuite ISAXC’s Arith.  KernelsTLB  Kernels randomcompiled ISAXC’s Arith.  Kernels randomcompiled Testsuite on Synthesized VIRAM-1 (superset of above) Entire VIRAM-1 Testsuite Testsuite on Synthesized MIPS testsuite is about 1700 test-mode combinations + <100 FP tests-mode combinations that are valid for the VIRAM-1 FPU Additionally, entire VIRAM-1 testsuite has about 2700 tests, ~ 24M instructions, and 4M lines of asm code Vector unit currently passes about all of them for a big endian, user mode. There are about 200 exception tests for both coprocessors Kernel tests are long, but there are only about 100 of them Arithmetic  Kernels must be run on the combined design Additional microarchitecture specific, and vector TAP tests have been run. Currently running random tests to find bugs. MIPS Vector Subset of VIRAM-1 Testsuite FPU Subset of VIRAM-1 Testsuite + MIPS FPU Testsuite Entire VIRAM-1 Testsuite

Slide 11 IRAM update: Schedule Scheduled tape-out was May 1, 2001 Based on schedule IBM was expecting June, July 2001 We think which we’ll make June 2001

Slide 12 IRAM update: Package/Impact Kyocera 304 pin Quad Flat Pack Cavity is 20.0 x 20.0 mm Must allow space around die mm Simplify bonding by putting pads on all 4 sides Need to shrink DRAM to make it fit Simplify routing by allowing extra height in lane: 14 MB=>3.0 mm, 13 MB=>3.8, 12=>4.8 => 13 MB +- 1 MB, depending on how routing (Also shows strength of design style in that can adjust memory, die size at late stage)

Slide 13 Floorplan Technology: IBM SA-27E –0.18  m CMOS –6 metal layers (copper) 280 mm 2 die area –18.72 x 15 mm –~200 mm 2 for memory/logic –DRAM: ~140 mm 2 –Vector lanes: ~50 mm 2 Transistor count: >100M Power supply –1.2V for logic, 1.8V for DRAM 15 mm 18.7 mm

Slide 14 IRAM update: Clock cycle/power Clock cycle rate was 200 MHz, 1.2v for logic to keep at 2W total MIPS synthesizable core will not run at 200 MHz at 1.2v Keep 2W (1.2v) target, and whatever clock rate (~170 v. 200 MHz), or keep 200 MHz clock rate target, and increase voltage to whatever it needs (1.8v?)? Plan is to stay with 1.2v since register file designed at 1.2v

Slide 15 MIPS Demo Board Runs Linix, has Ethernet +I/O Main board + daughter card = MIPS CPU chip + interfaces ISI designs VIRAM daughter card? Meeting with ISI soon to discuss

Slide 16 Embedded DRAM in the News Sony ISSCC mm2 chip with 256-Mbit of on-chip embedded DRAM (8X Emotion engine) –0.18-micron design rules –21.7 x 21.3-mm and contains million transistors 2,000-bit internal buses can deliver 48 gigabytes per second of bandwidth Demonstrated at Siggraph 2000 Used in multiprocessor graphics system?

Slide 17 High Confidence Computing? High confidence => a system can be trusted or relied upon? You can't rely on a system that's down High Confidence includes more than availability, but availability a prerequisite to high confidence?

Slide 18 Goals,Assumptions of last 15 years Goal #1: Improve performance Goal #2: Improve performance Goal #3: Improve cost-performance Assumptions –Humans are perfect (they don’t make mistakes during wiring, upgrade, maintenance or repair) –Software will eventually be bug free (good programmers write bug-free code) –Hardware MTBF is already very large, and will continue to increase (~100 years between failures)

Slide 19 Lessons learned from Past Projects for High Confidence Computing Major improvements in Hardware Reliability –1990 Disks 50,000 hour MTBF to 1,200,000 in 2000 –PC motherboards from 100,000 to 1,000,000 hours Yet Everything has an error rate –Well designed and manufactured HW: >1% fail/year –Well designed and tested SW: > 1 bug / 1000 lines –Well trained, rested people doing routine tasks: >1% –Well run collocation site (e.g., Exodus): 1 power failure per year, 1 network outage per year Components fail slowly –Disks, Memory, software give indications before fail (Interfaces don’t pass along this information)

Slide 20 Lessons learned from Past Projects for High Confidence Computing Maintenance of machines (with state) expensive –~10X cost of HW per year –Stateless machines can be trivial to maintain (Hotmail) System administration primarily keeps system available –System + clever human = uptime –Also plan for growth, fix performance bugs, do backup Software upgrades necessary, dangerous –SW bugs fixed, new features added, but stability? –Admins try to skip upgrades, be the last to use one

Slide 21 Lessons learned from Past Projects for High Confidence Computing Failures due to people up, hard to measure –VAX crashes ‘85, ‘93 [Murp95]; extrap. to ‘01 –HW/OS 70% in ‘85 to 28% in ‘93. In ‘01, 10%? –How get administrator to admit mistake? (Heisenberg?)

Slide 22 Lessons learned from Past Projects for High Confidence Computing Component performance varies –Disk inner track vs. outer track: 1.8X Bandwidth –Refresh of DRAM –Daemon processes in nodes of cluster –Error correction, retry on some storage accesses –Maintenance events in switches (Interfaces don’t pass along this information) Know how to improve performance (and cost) –Run system against workload, measure, innovate, repeat –Benchmarks standardize workloads, lead to competition, evaluate alternatives; turns debates into numbers

Slide 23 An Approach to High Confidence "If a problem has no solution, it may not be a problem, but a fact, not be solved, but to be coped with over time." Shimon Peres, quoted in Rumsfeld's Rules Rather than aim towards (or expect) perfect hardware, software, & people, assume flaws Focus on Mean Time To Repair (MTTR), for whole system including people who maintain it –Availability = MTTR / MTBF, so 1/10th MTTR just as valuable as 10X MTBF –Improving MTTR and hence availability should improve cost of administration/maintenance as well

Slide 24 An Approach to High Confidence Assume we have a clean slate, not constrained by 15 years of cost-performance optimizations 4 Parts to Time to Repair: 1) Time to detect error, 2) Time to pinpoint error (“root cause analysis”), 3) Time to chose try several possible solutions fixes error, and 4) Time to fix error

Slide 25 An Approach to High Confidence 1) Time to Detect errors Include interfaces that report faults/errors from components –May allow application/system to predict/identify failures Periodic insertion of test inputs into system with known results vs. wait for failure reports

Slide 26 An Approach to High Confidence 2) Time to Pinpoint error Error checking at edges of each component Design each component so it can be isolated and given test inputs to see if performs Keep history of failure symptoms/reasons and recent behavior (“root cause analysis”)

Slide 27 An Approach to High Confidence 3) Time to try possible solutions: History of errors/solutions Undo of any repair to allow trial of possible solutions –Support of snapshots, transactions/logging fundamental in system –Since disk capacity, bandwidth is fastest growing technology, use it to improve repair? –Caching at many levels of systems provides redundancy that may be used for transactions?

Slide 28 An Approach to High Confidence 4) Time to fix error: Create Repair benchmarks –Competition leads to improved MTTR Include interfaces that allow Repair events to be systematically tested –Predictable fault insertion allows debugging of repair as well as benchmarking MTTR Since people make mistakes during repair, “undo” for any maintenance event –Replace wrong disk in RAID system on a failure; undo and replace bad disk without losing info –Undo a software upgrade

Slide 29 Other Ideas for High Confidence Continuous preventative maintenance tasks? –~ 10% resources to repair errors before fail –Resources reclaimed when failure occurs to mask performance impact of repair? Sandboxing to limit the scope of an error? –Reduce error propagation since can have large delay between fault and failure discovery Processor level support for transactions? –Today on failure try to clean up shared state –Common failures: not or repeatedly freeing memory, data structure inconsistent, forget release latch –Transactions make failure rollback reliable?

Slide 30 Other Ideas for High Confidence Use interfaces that report, expect performance variability vs. expect consistency? –Especially when trying to repair –Example: work allocated per server based on recent performance vs. based on expected performance Queued interfaces, flow control accommodate performance variability, failures? –Example: queued communication vs. Barrier/Bulk Synchronous communication for distributed program

Slide 31 Conclusion New foundation to reduce MTTR –Cope with fact that people, SW, HW fail (Peres) –Transactions/snapshots to undo failures, bad repairs –Repair benchmarks to evaluate MTTR innovations –Interfaces to allow error insertion, input insertion, report module errors, report module performance –Module I/O error checking and module isolation –Log errors and solutions for root cause analysis, pick approach to trying to solve problem Significantly reducing MTTR => increased availability => foundation for High Confidence Computing