Technische universiteit eindhoven ‘Nothing is built on stone; all is built on sand, but we must build as if the sand were stone.’ Jorge Luis Borges (Argentine.

Slides:



Advertisements
Similar presentations
Network II.5 simulator ..
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Control path Recall that the control path is the physical entity in a processor which: fetches instructions, fetches operands, decodes instructions, schedules.
CA 714CA Midterm Review. C5 Cache Optimization Reduce miss penalty –Hardware and software Reduce miss rate –Hardware and software Reduce hit time –Hardware.
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
1/1/ / faculty of Electrical Engineering eindhoven university of technology Architectures of Digital Information Systems part 5: Special and weird ‘processor’
University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,
TU/e Processor Design 5Z0321 Processor Design 5Z032 Computer Systems Overview Chapter 1 Henk Corporaal Eindhoven University of Technology 2011.
DH2T 34 Computer Architecture 1 LO2 Lesson Two CPU and Buses.
Lecture 9: Coarse Grained FPGA Architecture October 6, 2004 ECE 697F Reconfigurable Computing Lecture 9 Coarse Grained FPGA Architecture.
Extending the Unified Parallel Processing Speedup Model Computer architectures take advantage of low-level parallelism: multiple pipelines The next generations.
CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.
Chuanjun Zhang, UC Riverside 1 Low Static-Power Frequent-Value Data Caches Chuanjun Zhang*, Jun Yang, and Frank Vahid** *Dept. of Electrical Engineering.
PipeRench: A Coprocessor for Streaming Multimedia Acceleration Seth Goldstein, Herman Schmit et al. Carnegie Mellon University.
Processor Architectures and Program Mapping TU/e 5kk10 Henk Corporaal Jef van Meerbergen Bart Mesman Exploiting DLP SIMD architectures.
Chapter 5: Computer Systems Organization Invitation to Computer Science, Java Version, Third Edition.
Streaming Supercomputer Strawman Architecture November 27, 2001 Ben Serebrin.
Technische universiteit eindhoven ‘Nothing is built on stone; all is built on sand, but we must build as if the sand were stone.’ Jorge Luis Borges (Argentine.
1 Sec (2.1) Computer Architectures. 2 For temporary storage of information, the CPU contains cells, or registers, that are conceptually similar to main.
Multithreaded ASC Kevin Schaffer and Robert A. Walker ASC Processor Group Computer Science Department Kent State University.
The Imagine Stream Processor Flexibility with Performance March 30, 2001 William J. Dally Computer Systems Laboratory Stanford University
Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.
Fall 2008Introduction to Parallel Processing1 Introduction to Parallel Processing.
1Hot Chips 2000Imagine IMAGINE: Signal and Image Processing Using Streams William J. Dally, Scott Rixner, Ujval J. Kapasi, Peter Mattson, Jinyung Namkoong,
Author: D. Brooks, V.Tiwari and M. Martonosi Reviewer: Junxia Ma
CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.
Network-on-Chip: Communication Synthesis Department of Computer Science Texas A&M University.
Embedded Computer Architecture 5KK73 MPSoC Platforms Part2: Cell Bart Mesman and Henk Corporaal.
Interconnection Networks: Introduction
Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.
High Performance Embedded Computing © 2007 Elsevier Lecture 16: Interconnection Networks Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.
Introduction to Interconnection Networks. Introduction to Interconnection network Digital systems(DS) are pervasive in modern society. Digital computers.
Lecture 2: Field Programmable Gate Arrays September 13, 2004 ECE 697F Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays.
Paper Review I Coarse Grained Reconfigurable Arrays Presented By: Matthew Mayhew I.D.# ENG*6530 Tues, June, 10,
A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,
CAD for Physical Design of VLSI Circuits
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
Chapter 5: Computer Systems Organization Invitation to Computer Science, Java Version, Third Edition.
RICE UNIVERSITY ‘Stream’-based wireless computing Sridhar Rajagopal Research group meeting December 17, 2002 The figures used in the slides are borrowed.
Architectural and Physical Design Optimization for Efficient Intra-Tile Communication Liza Rodriguez Aurelio Morales EEL Embedded Systems Dept.
RICE UNIVERSITY DSP architectures for wireless communications Sridhar Rajagopal Department of Electrical and Computer Engineering Rice University, Houston.
Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day9:
Polygon Rendering on a Stream Architecture John D. Owens, William J. Dally, Ujval J. Kapasi, Scott Rixner, Peter Mattson, Ben Mowery Concurrent VLSI Architecture.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Bundled Execution.
Ben Gaudette Michael Pfeister CSE 520 Spring 2010.
M U N - February 17, Phil Bording1 Computer Engineering of Wave Machines for Seismic Modeling and Seismic Migration R. Phillip Bording February.
Computer Organization. This module surveys the physical resources of a computer system.  Basic components  CPU  Memory  Bus  I/O devices  CPU structure.
WJD Feb 3, 19981Tomorrow's Computing Engines Tomorrow’s Computing Engines February 3, 1998 Symposium on High-Performance Computer Architecture William.
1 Energy-Efficient Register Access Jessica H. Tseng and Krste Asanović MIT Laboratory for Computer Science, Cambridge, MA 02139, USA SBCCI2000.
February 12, 1999 Architecture and Circuits: 1 Interconnect-Oriented Architecture and Circuits William J. Dally Computer Systems Laboratory Stanford University.
RICE UNIVERSITY Flexible wireless communication architectures Sridhar Rajagopal Department of Electrical and Computer Engineering Rice University, Houston.
Lecture 3: Computer Architectures
Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.
Seok-jae, Lee VLSI Signal Processing Lab. Korea University
The Imagine Stream Processor Ujval J. Kapasi, William J. Dally, Scott Rixner, John D. Owens, and Brucek Khailany Presenter: Lu Hao.
RICE UNIVERSITY Flexible wireless communication architectures Sridhar Rajagopal Department of Electrical and Computer Engineering Rice University, Houston.
Univ. of TehranIntroduction to Computer Network1 An Introduction to Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.
ESE534: Computer Organization
Lecture 19: SRAM.
Architecture & Organization 1
Introduction to Computer Engineering
How does an SIMD computer work?
Laxmi Narayan Bhuyan SIMD Architectures Laxmi Narayan Bhuyan
Defect Tolerance for Nanocomputer Architecture
Stream Architecture: Rethinking Media Processor Design
Architecture & Organization 1
Compiler Supports and Optimizations for PAC VLIW DSP Processors
Computer Evolution and Performance
Introduction to Computer Engineering
Introduction to Computer Engineering
Presentation transcript:

technische universiteit eindhoven ‘Nothing is built on stone; all is built on sand, but we must build as if the sand were stone.’ Jorge Luis Borges (Argentine writer ) Department of Electrical Engineering Electronic Systems Modeling of Architectures Embedded Computer Architecture 5KK73 Henk Corporaal Bart Mesman Hamed Fatemi 2011

5kk73 Electronic Systems 2 Outline We will look at models for Area, Delay and Energy Processor structure Register files - Register cell Model (area, power, delay) details for several register file configurations Apply this to the Imagine architecture Stream register file (SRF) Network

5kk73 Electronic Systems 3 Processor Single processor Instruction Memory (IM) Controller Processing Element (PE) Register File (RF) ALU Data Memory (DM) SIMD Multiple PEs VLIW Multiple ALUs Multi-Processor Several processors Connected by a bus or network IM Controller RFALUDM Network PE

5kk73 Electronic Systems 4 Register File (RF) Area model Assume: p = number of ports For large RF row decoder small compared to cell area 1-Bit area = w*h (tracks) Schematic of 1 register cell 1 wordline and bitline per port needed If p is large 1-bit of size w*h

5kk73 Electronic Systems 5 Register file (RF) Delay model Delay (d): Wire Propagation delay Fan-in/out delay Delay ~ wire length ~ connected cells R = number of registers, each b bits wide => N bits = bR Assuming square bit-layout Note: for N FUs (ALUs), p ~ 3N, R ~ N → d ~ N 3/2 (for large p wiring dominates)

5kk73 Electronic Systems 6 Register file (RF) Power model Register file Power (P): Proportional to the capacitance that must be switched for each access In each access every bit-line and one word-line  bit-line capacitance Each port drives (bR) 1/2 bit lines Each bit line has length (h+p) (bR) 1/2 If p is large: power is dominated by wire capacitance Note: for N FUs (ALUs), p ~ 3N, R ~ N → P ~ N 3

5kk73 Electronic Systems 7 Register File organization Processor with one level register Central (shared register file) DRF (distributed register file): ALU 1 ALU N ALU 1ALU N

5kk73 Electronic Systems 8 Comparing Area model of Central and Distributed RF Central (shared) RF : 2 read ports, one write port per ALU R= rN: number of registers of b bits r: number of register per ALU N: number of ALUs DRF : Only 2 ports: one read, one write This would give A(1 RF) ~ N Area of switch has same area cost complexity Square layout & organization of the DRF, including 2N*N crossbar

5kk73 Electronic Systems 9 Delay and Power models of central versus distributed RF Assume N ALUs Central RF: #registers R=rN #ports p =3N Large N DRF: Constant #registers per ALU #ports p=2 (also constant!) DRF has a fixed delay and power (per RF) Wire propagation determines delay and power (for large N) For large N

5kk73 Electronic Systems 10 Register File Register (memory) storage and communication between ALUs are critical parts for area, energy and performance in media processor. Hierarchical register storage

5kk73 Electronic Systems 11 2-levels register files (Hierarchical) Central: RF1 serves the ALUs, while RF2 is used to cover the memory latency Overall tendency for Area is the same as having one level RF ALU 1 ALU N RF2 (level 2) RF1 (level 1) DRF: ALU 1ALU N RF2 (level 2) RF1 (level 1)

5kk73 Electronic Systems 12 Register Files Processor with stream register files: Replace each port into the memory staging RF with a stream buffer All stream buffers share a single port into the memory staging RF, allowing that single physical port to act as many logical ports. Central: ALU 1 ALU N

5kk73 Electronic Systems 13 Register Files DRF: The payoff the transformation into a stream architecture is that we can achieve an area proportional to N^2, since R2 (memory storage) only needs 1 port. We also have to add in the area of the stream buffers, which grows as N^2 with a very small constant. ALU 1ALU N

5kk73 Electronic Systems 14 Results area per ALU (Normalized to 1 ALU)

5kk73 Electronic Systems 15 Results Local delay

5kk73 Electronic Systems 16 Results Power overhead

5kk73 Electronic Systems 17 Imagine Architecture Die Photo of ImagineCell placement of Imagine

5kk73 Electronic Systems 18 Imagine Floorplan 22 million transistors 500 MHz Area, Energy, Delay models Clusters, Micro- controller, SRF, Network Interface

5kk73 Electronic Systems 19 Stream register File

5kk73 Electronic Systems 20 Network: Area of network grows with (like DRF switch) : More details in khailany paper [2003]

5kk73 Electronic Systems 21 Exploration Intra-cluster scaling

5kk73 Electronic Systems 22 Exploration Inter-cluster scaling

5kk73 Electronic Systems 23 end More details: Scott Rixner, William J. Dally, Brucek Khailany, Peter Mattson, Ujval J.Kapasi, and John D. Owens. Register Organization for Media Processing. In Proceedings of the 6th International Symposium on High-Performance Computer Architecture (HPCA), pages 375–386, Toulouse, France, January IEEE Computer Society. Brucek Khailany, William Dally, Scott Rixner, Ujval Kapasi, John Owens, and Brian Towles. Exploring the vlsi scalability of stream processors. In Proceedings of the Ninth Symposium on High Performance Computer Architecture (HPCA), pages 153– 164, Anaheim, California, USA, February IEEE Computer Society.