Retrospective on the VIRAM-1 Design Decisions Christoforos E. Kozyrakis IRAM Retreat January 9, 2001.

Slides:

Advertisements

Similar presentations

Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow

Advertisements

Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.

CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.

Topics covered: CPU Architecture CSE 243: Introduction to Computer Architecture and Hardware/Software Interface.

The University of Adelaide, School of Computer Science

TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.

Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.

Processor Overview Features Designed for consumer and wireless products RISC Processor with Harvard Architecture Vector Floating Point coprocessor Branch.

System Design Tricks for Low-Power Video Processing Jonah Probell, Director of Multimedia Solutions, ARC International.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Computer Organization and Architecture

VEGAS: Soft Vector Processor with Scratchpad Memory Christopher Han-Yu Chou Aaron Severance, Alex D. Brant, Zhiduo Liu, Saurabh Sant, Guy Lemieux University.

Memory Consistency in Vector IRAM David Martin. Consistency model applies to instructions in a single instruction stream (different than multi-processor.

Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.

7/14/2000 Page 1 Design of the IRAM FPU Ioannis Mavroidis IRAM retreat July 12-14, 2000.

Slide 1 Exploiting 0n-Chip Bandwidth The vector ISA + compiler technology uses high bandwidth to mask latency Compiled matrix-vector multiplication: 2.

Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

Introduction to ARM Architecture, Programmer’s Model and Assembler Embedded Systems Programming.

VIRAM-1 Architecture Update and Status Christoforos E. Kozyrakis IRAM Retreat January 2000.

Scalable Vector Coprocessor for Media Processing Christoforos Kozyrakis ( ) IRAM Project Retreat, July 12 th, 2000.

Pipelined Processor II CPSC 321 Andreas Klappenecker.

The Effect of Data-Reuse Transformations on Multimedia Applications for Different Processing Platforms N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.

Enhancing Embedded Processors with Specific Instruction Set Extensions for Network Applications A. Chormoviti, N. Vassiliadis, G. Theodoridis, S. Nikolaidis.

Code Generation CS 480. Can be complex To do a good job of teaching about code generation I could easily spend ten weeks But, don’t have ten weeks, so.

CH12 CPU Structure and Function

1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.

RISC and CISC. Dec. 2008/Dec. and RISC versus CISC The world of microprocessors and CPUs can be divided into two parts:

Computer performance.

Computer Architecture ECE 4801 Berk Sunar Erkay Savas.

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

ASIC/FPGA design flow. FPGA Design Flow Detailed (RTL) Design Detailed (RTL) Design Ideas (Specifications) Design Ideas (Specifications) Device Programming.

INTRODUCTION Crusoe processor is 128 bit microprocessor which is build for mobile computing devices where low power consumption is required. Crusoe processor.

Data Parallel FPGA Workloads: Software Versus Hardware Peter Yiannacouras J. Gregory Steffan Jonathan Rose FPL 2009.

Fine-Grain Performance Scaling of Soft Vector Processors Peter Yiannacouras Jonathan Rose Gregory J. Steffan ESWEEK – CASES 2009, Grenoble, France Oct.

SPREE RTL Generator RTL Simulator RTL CAD Flow 3. Area 4. Frequency 5. Power Correctness1. 2. Cycle count SPREE Benchmarks Verilog Results 3. Architecture.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Main Memory CS448.

1 Fly – A Modifiable Hardware Compiler C. H. Ho 1, P.H.W. Leong 1, K.H. Tsoi 1, R. Ludewig 2, P. Zipf 2, A.G. Oritz 2 and M. Glesner 2 1 Department of.

RISC By Ryan Aldana. Agenda Brief Overview of RISC and CISC Features of RISC Instruction Pipeline Register Windowing and renaming Data Conflicts Branch.

ARM for Wireless Applications ARM11 Microarchitecture On the ARMv6 Connie Wang.

Ted Pedersen – CS 3011 – Chapter 10 1 A brief history of computer architectures CISC – complex instruction set computing –Intel x86, VAX –Evolved from.

COMPUTER ORGANIZATION AND ASSEMBLY LANGUAGE Lecture 19 & 20 Instruction Formats PDP-8,PDP-10,PDP-11 & VAX Course Instructor: Engr. Aisha Danish.

Crosscutting Issues: The Rôle of Compilers Architects must be aware of current compiler technology Compiler Architecture.

Stored Programs In today’s lesson, we will look at: what we mean by a stored program computer how computers store and run programs what we mean by the.

RICE UNIVERSITY DSPs for future wireless systems Sridhar Rajagopal.

NISC set computer no-instruction

Lx: A Technology Platform for Customizable VLIW Embedded Processing.

Sun Microsystems’ UltraSPARC-IIi a Stunt-Free Presentation by Christine Munson Amanda Peters Carl Sadler.

The Effect of Data-Reuse Transformations on Multimedia Applications for Application Specific Processors N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.

SEMINAR ON ARM PROCESSOR

Vector computers.

System on a Programmable Chip (System on a Reprogrammable Chip)

Rough Schedule 1:30-2:15 IRAM overview 2:15-3:00 ISTORE overview break

Christopher Han-Yu Chou Supervisor: Dr. Guy Lemieux

Architecture & Organization 1

A Closer Look at Instruction Set Architectures

Single Clock Datapath With Control

from classroom to research: providing different

Architecture & Organization 1

The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.

* From AMD 1996 Publication #18522 Revision E

Computer Architecture

CS 252 Spring 2000 Jeff Herman John Loo Xiaoyi Tang

CPU Structure CPU must:

Lecture 4: Instruction Set Design/Pipelining

CSE 502: Computer Architecture

Presentation transcript:

Retrospective on the VIRAM-1 Design Decisions Christoforos E. Kozyrakis IRAM Retreat January 9, 2001

VIRAM-1 Design RetrospectiveC.E. Kozyrakis, 1/ What We Probably Got Right Low power design approach Use of a commercial MIPS core Permutation instructions Fixed-point arithmetic model Single load-store unit Dropping of the network interface Testing infrastructure

VIRAM-1 Design RetrospectiveC.E. Kozyrakis, 1/ Low Power Design Approach Two design alternatives for VIRAM-1 –200 MHz, 2 W, 4 vector lanes –500 MHz, 10 W (?), 4-8 vector lanes (?) Low power was the right choice because –Low power is important for embedded and multimedia applications –It is easier to design a low power processor than a high frequency one –High power consumption would severely interfere with DRAM operation

VIRAM-1 Design RetrospectiveC.E. Kozyrakis, 1/ Use of Commercial MIPS Core Scalar core alternatives –Custom design optimized for a vector unit –Commercial core with generic coprocessor interface The MIPS m5Kc core was a great choice because –It is a flexible, synthesizable design with a lot of documentation and support –It comes with a RTL simulation environment which we reused for VIRAM-1 –It allowed us to work on a demo system based on a MIPS daughter-card and demo board

VIRAM-1 Design RetrospectiveC.E. Kozyrakis, 1/ Other Issues We Got Right Simple instructions for intra-register permutations –Allow the vectorization of reductions and FFT –Simple implementation compared to a general permutation Single load-store unit –Not sufficient memory bandwidth for two units –Address calculation and translation resources are expensive –Not obviously useful for most media applications

VIRAM-1 Design RetrospectiveC.E. Kozyrakis, 1/ Other Issues We Got Right Dropping of the network interface –Not necessary for embedded/multimedia systems –Would introduce significant design complexity Testing infrastructure –Highly automated and easy to use for developing tests and verifying the complete VIRAM-1 design

VIRAM-1 Design RetrospectiveC.E. Kozyrakis, 1/ What We Probably Got Wrong Insufficient benchmarking at early project stages Support for 64-bit data-types Lack of sub-banks in DRAM macros Dropping the decoupled pipeline Use of a crossbar for memory transfers Too much support for arithmetic exceptions Too much support for conditional execution

VIRAM-1 Design RetrospectiveC.E. Kozyrakis, 1/ Insufficient Benchmarking Limited benchmarking was performed early enough to affect major design decisions –Previous experience and intuition used in several cases Reasons for limited benchmarking –Lack of compiler –Lack of flexible performance model –Lack of man power and time Some of the following issues could probably be avoided if we had done more benchmarking

VIRAM-1 Design RetrospectiveC.E. Kozyrakis, 1/ Support for 64-bit Data Types VIRAM-1 supports 64-bit integer operations Excluding encryption, few multimedia applications require 64-bit operations Benefits from not supporting 64-bit operations –Large area savings from datapaths and pipeline registers –Large wiring savings from reduced width of data busses –Fewer modes to support and verify

VIRAM-1 Design RetrospectiveC.E. Kozyrakis, 1/ Lack of DRAM Sub-banks The DRAM macro used has a single bank –No overlapping of accesses to different rows is allowed Significant performance bottleneck for applications with strided or random accesses –4 addresses per cycle for 8 banks with 5 cycles random access cycle –Bank conflicts reduce random bandwidth even further

VIRAM-1 Design RetrospectiveC.E. Kozyrakis, 1/ Other Issues We Got Wrong Dropping the decoupled pipeline –The “delayed pipeline” was preferred to a decoupled one due to complexity and power advantages, despite the performance issues –Due to the length of the pipeline and the lack of sub- banks, it is not obvious that this was a wise decision Use of a crossbar for memory transfers –The memory crossbar is the weakest design component in terms of scalability and flexibility –Alternative approaches (e.g. ring) were probably worth a closer examination before rejecting

VIRAM-1 Design RetrospectiveC.E. Kozyrakis, 1/ Other Issues We Got Wrong Too much support for arithmetic exceptions –VIRAM-1 includes extensive support for software speculation, user-level handlers, precise execution (slower) for arithmetic exceptions –Many of these features will never be used by the compiler, multimedia applications, or system software Too much support for conditional execution –VIRAM-1 implements all possible alternatives for vector conditional execution (masked instructions, masked merger, scatter-gather, compress-expand) –Some of the are quite complex to implement and not obviously need for multimedia codes

VIRAM-1 Design RetrospectiveC.E. Kozyrakis, 1/ What May Be Too Early To Call Full-custom design of integer datapaths –Optimal area and power consumption but requires significant design time –Maybe we could use an ASIC approach based on tiling specialized macro-cells or library components Use of two multipliers per vector lane –Most applications don’t have such a high ration of multiply or multiply-add operations –Consumes a significant amount of area