A 167-processor 65 nm Computational Platform with Per-Processor Dynamic Supply Voltage and Dynamic Clock Frequency Scaling Dean Truong, Wayne Cheng, Tinoosh.

Slides:



Advertisements
Similar presentations
DSPs Vs General Purpose Microprocessors
Advertisements

VADA Lab.SungKyunKwan Univ. 1 L3: Lower Power Design Overview (2) 성균관대학교 조 준 동 교수
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
The Raw Architecture Signal Processing on a Scalable Composable Computation Fabric David Wentzlaff, Michael Taylor, Jason Kim, Jason Miller, Fae Ghodrat,
THE RAW MICROPROCESSOR: A COMPUTATIONAL FABRIC FOR SOFTWARE CIRCUITS AND GENERAL- PURPOSE PROGRAMS Taylor, M.B.; Kim, J.; Miller, J.; Wentzlaff, D.; Ghodrat,
Houshmand Shirani-mehr 1,2, Tinoosh Mohsenin 3, Bevan Baas 1 1 VCL Computation Lab, ECE Department, UC Davis 2 Intel Corporation, Folsom, CA 3 University.
Presenter: Jeremy W. Webb Course: EEC 289Q: Reconfigurable Computing Course Instructor: Professor Soheil Ghiasi Processor Architectures At A Glance: M.I.T.
1 Advanced VLSI Course Class Presentation An Asynchronous Array of Simple Processors for DSP Applications Rasoul Yousefi Electrical and Computer Engineering.
The ARM7TDMI Hardware Architecture
1 Many Core Platform For DSP Applications Seminar in VLSI Architecture Ophir Erez.
University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.
MINIMISING DYNAMIC POWER CONSUMPTION IN ON-CHIP NETWORKS Robert Mullins Computer Architecture Group Computer Laboratory University of Cambridge, UK.
Improving the Efficiency of Memory Partitioning by Address Clustering Alberto MaciiEnrico MaciiMassimo Poncino Proceedings of the Design,Automation and.
Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
Adaptive System on a Chip (aSoC) for Low-Power Signal Processing Andrew Laffely, Jian Liang, Prashant Jain, Ning Weng, Wayne Burleson, Russell Tessier.
Architectural Power Management for High Leakage Technologies Department of Electrical and Computer Engineering Auburn University, Auburn, AL /15/2011.
ECE 526 – Network Processing Systems Design
1 Design and Implementation of Turbo Decoders for Software Defined Radio Yuan Lin 1, Scott Mahlke 1, Trevor Mudge 1, Chaitali.
Network-on-Chip: Communication Synthesis Department of Computer Science Texas A&M University.
1 Real time signal processing SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
Juanjo Noguera Xilinx Research Labs Dublin, Ireland Ahmed Al-Wattar Irwin O. Irwin O. Kennedy Alcatel-Lucent Dublin, Ireland.
Gigabit Routing on a Software-exposed Tiled-Microprocessor
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.
1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696
A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,
Low-Power Wireless Sensor Networks
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
1 Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction Rakesh Kumar, Keith I. Farkas, Norman P. Jouppi, Parthasarathy.
Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.
1 EE 587 SoC Design & Test Partha Pande School of EECS Washington State University
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
Fast Memory Addressing Scheme for Radix-4 FFT Implementation Presented by Cheng-Chien Wu, Master Student of CSIE,CCU 1 Author: Xin Xiao, Erdal Oruklu and.
Tinoosh Mohsenin and Bevan M. Baas VLSI Computation Lab, ECE Department University of California, Davis Split-Row: A Reduced Complexity, High Throughput.
© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.
Radix-2 2 Based Low Power Reconfigurable FFT Processor Presented by Cheng-Chien Wu, Master Student of CSIE,CCU 1 Author: Gin-Der Wu and Yi-Ming Liu Department.
Performance and Power Analysis of Globally Asynchronous Locally Synchronous Multiprocessor Systems Zhiyi Yu, Bevan M. Baas VLSI Computation Lab, ECE department,
The TM3270 Media-Processor. Introduction Design objective – exploit the high level of parallelism available. GPPs with Multi-media extensions (Ex: Intel’s.
Axel Jantsch 1 Networks on Chip Axel Jantsch 1 Shashi Kumar 1, Juha-Pekka Soininen 2, Martti Forsell 2, Mikael Millberg 1, Johnny Öberg 1, Kari Tiensurjä.
University of Michigan, Ann Arbor
DSP Architectures Additional Slides Professor S. Srinivasan Electrical Engineering Department I.I.T.-Madras, Chennai –
A 167-processor Computational Array for Highly-Efficient DSP and Embedded Application Processing Dean Truong, Wayne Cheng, Tinoosh Mohsenin, Zhiyi Yu,
Multi-Split-Row Threshold Decoding Implementations for LDPC Codes
An Introduction to VLSI (Very Large Scale Integrated) Circuit Design
ASPLOS’02 Presented by Kim, Sun-Hee.  Technology trends ◦ The rate of frequency scaling is slowing down  Performance must come from exploiting concurrency.
A Programmable Single Chip Digital Signal Processing Engine MAPLD 2005 Paul Chiang, MathStar Inc. Pius Ng, Apache Design Solutions.
An Asynchronous Array of Simple Processors for DSP Applications Zhiyi Yu, Michael Meeuwsen, Ryan Apperson, Omar Sattari, Michael Lai, Jeremy Webb, Eric.
A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.
Seok-jae, Lee VLSI Signal Processing Lab. Korea University
A 1.2V 26mW Configurable Multiuser Mobile MIMO-OFDM/-OFDMA Baseband Processor Motivations –Most are single user, SISO, downlink OFDM solutions –Training.
1 Dual-V cc SRAM Class presentation for Advanced VLSIPresenter:A.Sammak Adopted from: M. Khellah,A 4.2GHz 0.3mm 2 256kb Dual-V CC SRAM Building Block in.
1 Power-Aware System on a Chip A. Laffely, J. Liang, R. Tessier, C. A. Moritz, W. Burleson University of Massachusetts Amherst Boston Area Architecture.
Implementing Tile-based Chip Multiprocessors with GALS Clocking Styles Zhiyi Yu, Bevan Baas VLSI Computation Lab, ECE Department University of California,
A Low-Area Interconnect Architecture for Chip Multiprocessors Zhiyi Yu and Bevan Baas VLSI Computation Lab ECE Department, UC Davis.
An FFT for Wireless Protocols Dr. J. Greg Nash Centar ( HAWAI'I INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES Mobile.
CS203 – Advanced Computer Architecture
Tinoosh Mohsenin 2, Houshmand Shirani-mehr 1, Bevan Baas 1 1 University of California, Davis 2 University of Maryland Baltimore County Low Power LDPC Decoder.
LOW POWER DESIGN METHODS
-1- Soft Core Viterbi Decoder EECS 290A Project Dave Chinnery, Rhett Davis, Chris Taylor, Ning Zhang.
Temperature and Power Management
ECE Department, University of California, Davis
LOW POWER DESIGN METHODS V.ANANDI ASST.PROF,E&C MSRIT,BANGALORE.
Architecture & Organization 1
An Improved Split-Row Threshold Decoding Algorithm for LDPC Codes
Centar ( Global Signal Processing Expo
Architecture & Organization 1
High Throughput LDPC Decoders Using a Multiple Split-Row Method
Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
Presentation transcript:

A 167-processor 65 nm Computational Platform with Per-Processor Dynamic Supply Voltage and Dynamic Clock Frequency Scaling Dean Truong, Wayne Cheng, Tinoosh Mohsenin, Zhiyi Yu, Toney Jacobson, Gouri Landge, Michael Meeuwsen, Christine Watnik, Paul Mejia, Anh Tran, Jeremy Webb, Eric Work, Zhibin Xiao and Bevan Baas VLSI Computation Lab University of California, Davis

Outline Background and the First Generation AsAP The Second Generation AsAP –Processors and Shared Memories –On-chip Communication –DVFS Analysis and Summary

Project Motivation Fully programmable and reconfig. architecture High energy efficiency and performance Exploit task-level parallelism in: –Digital Signal Processing –Multimedia a Baseband Receiver

Asynchronous Array of Simple Processors (AsAP) Key Ideas: –Programmable, small, and simple fine-grained cores –Small local memories sufficient for DSP kernels –Globally Asynchronous and Locally Synchronous (GALS) clocking Independent clock frequencies on every core Local oscillator halts when processor is idle Osc IMem Datapath DMem

Asynchronous Array of Simple Processors (AsAP) Key Ideas, con’t: –2D mesh, circuit-switched network architecture Nearest-neighbor communication only Low area overhead Easily scalable array –Increased tolerance to process variations 36-processor fully-functional chip, 0.18 µm, V, 0.66 mm 2 per processor, a/g tx consumes MHz [ISSCC 06, HotChips 06, IEEE Micro 07, TVLSI 07, JSSC 08,…]

Outline Background and the First Generation AsAP The Second Generation AsAP –Processors and Shared Memories –On-chip Communication –DVFS Analysis and Summary

New Challenges Addressed Reduction in the power dissipation of –Lightly-loaded processors (lowering Vdd) –Unused processors (leakage) Achieving very high efficiencies and speed on common demanding tasks such as FFTs, video motion estimation, and Viterbi decoding Larger, area efficient on-chip memories Efficient, low overhead communication between distant processors

Key features –164 Enhanced programmable processors –3 Dedicated-purpose processors –3 Shared memories –Long-distance circuit-switched communication network –Dynamic Voltage and Frequency Scaling (DVFS) 167-processor Computational Platform

Homogenous Processors 164 in-order, single-issue, 6-stage processors –16-bit datapath with RISC-like instructions –128x16-bit data memory (DMEM) –128x35-bit instruction memory (IMEM) –Two 64x16-bit FIFOs for inter-processor communication

Homogenous Processors Over 60 basic instructions –Add, Sub, Logic, Multiply, MAC, Branch, … New instructions and features –Min/Max, Byte-Sub/Add, Absolute value, Fixed-to-Float conversion assist –Jump/Return (function support) –Zero Overhead Looping (block repeat) –Conditional Execution (predicating) –Block Floating Point Floating point CORDIC square root requires 2.9x fewer cycles compared to first generation AsAP Preliminary results from one chip: 1.2 GHz, 59 mW, 1.3 V, 100% active MAC/ALU operations

Fast Fourier Transform (FFT) Continuous flow architecture with a single radix-4,2 butterfly Runtime configurable from 16-pt to 4096-pt transforms, FFT and IFFT 760, pt complex 989 MHz, 1.3 V 1.01 mm 2 Preliminary measurements functional at 866 MHz, V 1.23 mm 0.82 mm MEM O F

8 Add-Compare-Select (ACS) units Highly configurable –Up to 32 different rates, including 1/2 and 3/4 –Decode codes up to constraint length MHz, 1.3 V for rate = 1/ mm 2 Preliminary measurements functional at 894 MHz, V 0.41 mm MEM F O Viterbi Decoder

Supports a number of fixed and programmable search patterns Supports all H.264 specified block sizes within a 48x48 search range 14 billion 880 MHz, 1.3 V; supports 1080p 30fps 0.67 mm 2 Preliminary measurements functional at 938 MHz, V Motion Estimation for Video Encoding 0.82 mm O MEM F

Shared Memories Ports for up to four processors (two connected in this chip) to directly connect to the block, which provides –Port priority –Port request arbitration –Programmable address generation supporting multiple addressing modes Uses a 16 KByte single-ported SRAM 1.28 GHz operation, 1.3 V –One read or write per cycle –20.5 Gbps peak throughput 0.34 mm 2 Preliminary measurements functional at 1.3 GHz, V 0.41 mm 0.82 mm SRAM F O

Inter-Processor Communication Circuit-switched source-synchronous communication –Each link has a clk, 16-bit data bus, valid, and request –Core can Write to any combination of the 8 outputs under software control Read from any 2 of the 8 inputs using statically configured FIFOs

Long-Distance Communication Allows communication across tiles without disturbing cores –Long-distance links may be pipelined or not Depending on: source clock frequency, distance, and latency Source Destination

Per-Processor DVFS Each processor tile contains a core that operates at: –A fully-independent clock frequency Any frequency below maximum Halts, restarts, and changes arbitrarily –Dynamically-changeable supply voltage VddHigh or VddLow Disconnected for leakage reduction VddAlwaysOn powers DVFS and inter-processor communication Tile

DVFS Controller Voltage and frequency is set by: –Static configuration –Software –Hardware (controller) FIFO “fullness” Processor “stalling frequency”

Tile Layout and Power Grids 48 power gates surround core Metal 6 and 7 are devoted to power distribution— global and local –5 Vdds: 79% utilization –2 Gnds: 21% utilization Vdd/Gnd metal 6 and 7 usage: 340 μm 360 μm 410 μm DMem Core Power Gates IMem FIFO 1 FIFO 0 Osc Tile Power Gates and Decaps

Supply Voltage Switching The switching speed and profile “shapes” supply currents while switching to tradeoff switching time versus power grid noise Processor cores normally halt during a switch fastest medium slowest medium fastest VddHigh switch done clock VddHigh noise when VddCore switches from VddLow to VddHigh 1.3V 0V 1.22V 1.3V 50ns 51ns52ns53ns 54ns

Supply Voltage Switching Slow switching results in negligible power grid noise Early VddCore disconnect from VddLow results in momentary core voltage drop (circled below) 2 ns 0.9V 1.3V

Outline Background and the First Generation AsAP The Second Generation AsAP –Processors and Shared Memories –On-chip Communication –DVFS Analysis and Summary

Tile and Core Area Breakdowns Communication area approximately 7% DVFS area approximately 8% Routing complexity results in 27% for gaps and fillers Tile Area SRAMs 34% Core Area

Die Micrograph and Key Data 65 nm STMicroelectronics low-leakage CMOS Single Tile Transistors 325,000 Area 0.17 mm 2 Max. frequency V Power (100% active) GHz, 1.3 V MHz, V Power (JPEG) GHz, 1.3V 55 million transistors, 39.4 mm mm 410 μm FFT Vit Mot. Est. Mem mm Mem

Complete a Baseband Receiver 22 processors –32 processors using only nearest-neighbor connections (46% increase) 54 Mbps throughput MHz, 1.3 V 6x faster than TI C62x, 2x faster than SODA, 4x faster than LART (all scaled to 65 nm technology, 1.3 V and 610 MHz) a without long-distance a with long-distance

Summary 65 nm low power ST Microelectronics process Maintains the basic GALS architecture of AsAP 164 homogenous processors –1.2 GHz, 59 mW, 100% 1.3 V Three 16 KB shared memories Three dedicated-purpose processors Long-distance circuit-switched communication increases mapping efficiency without overhead DVFS nets a 48% reduction in energy for JPEG with only 8% performance loss

Acknowledgements STMicroelectronics Intel UC Micro NSF Grant and CAREER award SRC GRC Intellasys SEM J.-P. Schoellkopf K. Torki and S. Dumont R. Krishnamurthy and M. Anders