Tile Size Selection for Low-Power Tile-based Architectures Michael Brown.

Slides:

Advertisements

Similar presentations

FPGA (Field Programmable Gate Array)

Advertisements

VADA Lab.SungKyunKwan Univ. 1 L3: Lower Power Design Overview (2) 성균관대학교 조 준 동 교수

Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project.

University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,

4/22/ Clock Network Synthesis Prof. Shiyan Hu Office: EREC 731.

THE RAW MICROPROCESSOR: A COMPUTATIONAL FABRIC FOR SOFTWARE CIRCUITS AND GENERAL- PURPOSE PROGRAMS Taylor, M.B.; Kim, J.; Miller, J.; Wentzlaff, D.; Ghodrat,

System Design Tricks for Low-Power Video Processing Jonah Probell, Director of Multimedia Solutions, ARC International.

Higher Computing: Unit 1: Topic 3 – Computer Performance St Andrew’s High School, Computing Department Higher Computing Topic 3 Computer Performance.

Lecture 9: Coarse Grained FPGA Architecture October 6, 2004 ECE 697F Reconfigurable Computing Lecture 9 Coarse Grained FPGA Architecture.

Evaluating Performance and Power of Object-oriented vs. Procedural Programming in Embedded Processors A. Chatzigeorgiou, G. Stephanides Department of Applied.

A Programmable Coprocessor Architecture for Wireless Applications Yuan Lin, Nadav Baron, Hyunseok Lee, Scott Mahlke, Trevor Mudge Advance Computer Architecture.

CS 7810 Lecture 3 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.

Chapter 1 and 2 Computer System and Operating System Overview

On the Task Assignment Problem : Two New Efficient Heuristic Algorithms.

Introduction to FPGA and DSPs Joe College, Chris Doyle, Ann Marie Rynning.

Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine.

Chapter 6 Memory and Programmable Logic Devices

Juanjo Noguera Xilinx Research Labs Dublin, Ireland Ahmed Al-Wattar Irwin O. Irwin O. Kennedy Alcatel-Lucent Dublin, Ireland.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 1 Fundamentals of Quantitative Design and Analysis Computer Architecture A Quantitative.

6.893: Advanced VLSI Computer Architecture, September 28, 2000, Lecture 4, Slide 1. © Krste Asanovic Krste Asanovic

A performance analysis of multicore computer architectures Michel Schelske.

Low-Power Wireless Sensor Networks

Presenter: Hong-Wei Zhuang On-Chip SOC Test Platform Design Based on IEEE 1500 Standard Very Large Scale Integration (VLSI) Systems, IEEE Transactions.

Computer Architecture. “The design of a computer system. It sets the standard for all devices that connect to it and all the software that runs on it.

Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.

Efficient FPGA Implementation of QR

A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.

1 Down Place Hammersmith London UK 530 Lytton Ave. Palo Alto CA USA.

Automated Design of Custom Architecture Tulika Mitra

1 EE5900 Advanced Embedded System For Smart Infrastructure Energy Efficient Scheduling.

1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah

Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.

Course Wrap-Up Miodrag Bolic CEG4136. What was covered Interconnection network topologies and performance Shared-memory architectures Message passing.

Chapter 6 Multiprocessor System. Introduction  Each processor in a multiprocessor system can be executing a different instruction at any time.  The.

Embedded Runtime Reconfigurable Nodes for wireless sensor networks applications Chris Morales Kaz Onishi 1.

Energy-Effective Issue Logic Hasan Hüseyin Yılmaz.

CSE477 L24 RAM Cores.1Irwin&Vijay, PSU, 2002 CSE477 VLSI Digital Circuits Fall 2002 Lecture 24: RAM Cores Mary Jane Irwin ( )

CSE 661 PAPER PRESENTATION

InterConnection Network Topologies to Minimize graph diameter: Low Diameter Regular graphs and Physical Wire Length Constrained networks Nilesh Choudhury.

2 Systems Architecture, Fifth Edition Chapter Goals Describe the system bus and bus protocol Describe how the CPU and bus interact with peripheral devices.

ATtiny23131 A SEMINAR ON AVR MICROCONTROLLER ATtiny2313.

1 Power estimation in the algorithmic and register-transfer level September 25, 2006 Chong-Min Kyung.

Basics of Energy & Power Dissipation

LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung Wong Chung Hoi Supervised by Prof. Michael R. Lyu Department of Computer.

Introduction to Clock Tree Synthesis

Computer Science and Engineering Power-Performance Considerations of Parallel Computing on Chip Multiprocessors Jian Li and Jose F. Martinez ACM Transactions.

FPGA-Based System Design: Chapter 6 Copyright  2004 Prentice Hall PTR Topics n Low power design. n Pipelining.

Lx: A Technology Platform for Customizable VLIW Embedded Processing.

Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

SCORES: A Scalable and Parametric Streams-Based Communication Architecture for Modular Reconfigurable Systems Abelardo Jara-Berrocal, Ann Gordon-Ross NSF.

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.

07/11/2005 Register File Design and Memory Design Presentation E CSE : Introduction to Computer Architecture Slides by Gojko Babić.

Lecture 17: Dynamic Reconfiguration I November 10, 2004 ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon.

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

A Low-Area Interconnect Architecture for Chip Multiprocessors Zhiyi Yu and Bevan Baas VLSI Computation Lab ECE Department, UC Davis.

1 of 14 Lab 2: Design-Space Exploration with MPARM.

L9 : Low Power DSP Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab.

CS203 – Advanced Computer Architecture

RAM RAM - random access memory RAM (pronounced ramm) random access memory, a type of computer memory that can be accessed randomly;

CS203 – Advanced Computer Architecture

Lynn Choi School of Electrical Engineering

Memory Segmentation to Exploit Sleep Mode Operation

Ioannis E. Venetis Department of Computer Engineering and Informatics

Evaluating Register File Size

A Quantitative Analysis of Stream Algorithms on Raw Fabrics

Hakim Weatherspoon CS 3410 Computer Science Cornell University

CS/EE 6810: Computer Architecture

Computer Evolution and Performance

Rohan Yadav and Charles Yuan (rohany) (chenhuiy)

Reconfigurable Computing (EN2911X, Fall07)

Presentation transcript:

Tile Size Selection for Low-Power Tile-based Architectures Michael Brown

Observation ● Some types of applications run better over multiple cores than other types

Goal ● Given an application(s) and an embedded performance target ● Find the computational granularity that will minimize power consumed

Refresh of Embedded DSPs ● No need to exceed performance target ● (paper) Not concerned with area ● Want lowest power possible

Steps ● Generate a set of tile architectures ● Determine power efficiency of each ● Parallelize and profile algorithm ● Compare costs

Computational Graularity ● Defined as maximum arith. ops/cycle (where sources are local) 1:32 4:8 32:1 (# of tiles):(# of operations/cycle/tile)

Generating Architectures ● CMP with 1-32 cores (tiles) that maintains a constant computational width ● Used Synchroscalar to create tiles based on the Blackfin DSP (Analog Devices)

Determine Power Efficiency ● Large tiles have high switching capacitance per operation ● Small tiles have poor data locality requiring extra cycles

Parallelize Algorithms ● Choose static media-based apps. ● Allows data flow graph to be made to maximize parallelism ● Data flow graph also allows profiling communication between parallel elements (recursive bisection algorithm)

Compare Costs ● Cost of hardware – SRAM, control logic, computational units – Register file – Interconnect ● Cost of software – Inter-tile communication

Compare Costs - Hardware ● SRAM, control logic, comp. Units – Area and power grow linearly with tile size ● Register file – Capacity and number of ports grow linearly – Area and power grow quadratically ● Interconnect wiring – Area and power grow quadratically like reg.

Compare Costs - Software ● Case1: minimal communication – Power savings gained by using smaller tiles passed on as system power savings ● Case2: heavy communication – Power savings gained by using smaller tiles is lost to extra cycles required to maintain cache coherency between tiles – Needs higher frequency to make same wall clock time – Higher frequency then needs voltage scaling

Compare Costs

Comparing Additional Parameters - Communication Application communication for differing interconnects and tile sizes

Comparing Additional Parameters - Power Application power usage for differing interconnects and tile sizes

Results