A 167-processor Computational Array for Highly-Efficient DSP and Embedded Application Processing Dean Truong, Wayne Cheng, Tinoosh Mohsenin, Zhiyi Yu,

Slides:

Advertisements

Similar presentations

DSPs Vs General Purpose Microprocessors

Advertisements

VADA Lab.SungKyunKwan Univ. 1 L3: Lower Power Design Overview (2) 성균관대학교 조 준 동 교수

Computer Abstractions and Technology

The Raw Architecture Signal Processing on a Scalable Composable Computation Fabric David Wentzlaff, Michael Taylor, Jason Kim, Jason Miller, Fae Ghodrat,

Las Palmas de G.C., Dec IUMA Projects and activities.

BRASS Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, John Wawrzynek University of California, Berkeley – BRASS.

Houshmand Shirani-mehr 1,2, Tinoosh Mohsenin 3, Bevan Baas 1 1 VCL Computation Lab, ECE Department, UC Davis 2 Intel Corporation, Folsom, CA 3 University.

Presenter: Jeremy W. Webb Course: EEC 289Q: Reconfigurable Computing Course Instructor: Professor Soheil Ghiasi Processor Architectures At A Glance: M.I.T.

1 Advanced VLSI Course Class Presentation An Asynchronous Array of Simple Processors for DSP Applications Rasoul Yousefi Electrical and Computer Engineering.

Modern VLSI Design 4e: Chapter 8 Copyright  2008 Wayne Wolf Topics High-level synthesis. Architectures for low power. GALS design.

CSC457 Seminar YongKang Zhu December 6 th, 2001 About Network Processor.

The ARM7TDMI Hardware Architecture

L27:Lower Power Algorithm for Multimedia Systems 성균관대학교 조 준 동

1 Many Core Platform For DSP Applications Seminar in VLSI Architecture Ophir Erez.

1 Router Construction II Outline Network Processors Adding Extensions Scheduling Cycles.

1 Clockless Logic Montek Singh Tue, Mar 16, 2004.

MINIMISING DYNAMIC POWER CONSUMPTION IN ON-CHIP NETWORKS Robert Mullins Computer Architecture Group Computer Laboratory University of Cambridge, UK.

1 Multi - Core fast Communication for SoPC Multi - Core fast Communication for SoPC Technion – Israel Institute of Technology Department of Electrical.

Improving the Efficiency of Memory Partitioning by Address Clustering Alberto MaciiEnrico MaciiMassimo Poncino Proceedings of the Design,Automation and.

Adaptive System on a Chip (aSoC) for Low-Power Signal Processing Andrew Laffely, Jian Liang, Prashant Jain, Ning Weng, Wayne Burleson, Russell Tessier.

1 Design and Implementation of Turbo Decoders for Software Defined Radio Yuan Lin 1, Scott Mahlke 1, Trevor Mudge 1, Chaitali.

Router Construction II Outline Network Processors Adding Extensions Scheduling Cycles.

Network-on-Chip: Communication Synthesis Department of Computer Science Texas A&M University.

Juanjo Noguera Xilinx Research Labs Dublin, Ireland Ahmed Al-Wattar Irwin O. Irwin O. Kennedy Alcatel-Lucent Dublin, Ireland.

Institute of Electronics, National Chiao Tung University VLSI Signal Processing Lab A 242mW, 10mm2 H.264/AVC High Profile Encoder H.264 High Profile Encoder.

1 VLSI and Computer Architecture Trends ECE 25 Fall 2012.

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,

Low-Power Wireless Sensor Networks

Architectures for mobile and wireless systems Ese 566 Report 1 Hui Zhang Preethi Karthik.

A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.

1 Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction Rakesh Kumar, Keith I. Farkas, Norman P. Jouppi, Parthasarathy.

Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.

1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah

C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.

Tinoosh Mohsenin and Bevan M. Baas VLSI Computation Lab, ECE Department University of California, Davis Split-Row: A Reduced Complexity, High Throughput.

A 167-processor 65 nm Computational Platform with Per-Processor Dynamic Supply Voltage and Dynamic Clock Frequency Scaling Dean Truong, Wayne Cheng, Tinoosh.

RICE UNIVERSITY “Joint” architecture & algorithm designs for baseband signal processing Sridhar Rajagopal and Joseph R. Cavallaro Rice Center for Multimedia.

Chapter 1 Computer Abstractions and Technology. Chapter 1 — Computer Abstractions and Technology — 2 The Computer Revolution Progress in computer technology.

Performance and Power Analysis of Globally Asynchronous Locally Synchronous Multiprocessor Systems Zhiyi Yu, Bevan M. Baas VLSI Computation Lab, ECE department,

ATtiny23131 A SEMINAR ON AVR MICROCONTROLLER ATtiny2313.

The TM3270 Media-Processor. Introduction Design objective – exploit the high level of parallelism available. GPPs with Multi-media extensions (Ex: Intel’s.

Axel Jantsch 1 Networks on Chip Axel Jantsch 1 Shashi Kumar 1, Juha-Pekka Soininen 2, Martti Forsell 2, Mikael Millberg 1, Johnny Öberg 1, Kari Tiensurjä.

Multi-Split-Row Threshold Decoding Implementations for LDPC Codes

Baseband Implementation of an OFDM System for 60GHz Radios: From Concept to Silicon Jing Zhang University of Toronto.

An Asynchronous Array of Simple Processors for DSP Applications Zhiyi Yu, Michael Meeuwsen, Ryan Apperson, Omar Sattari, Michael Lai, Jeremy Webb, Eric.

FPGA-Based System Design: Chapter 6 Copyright  2004 Prentice Hall PTR Topics n Low power design. n Pipelining.

Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.

A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.

Seok-jae, Lee VLSI Signal Processing Lab. Korea University

A 1.2V 26mW Configurable Multiuser Mobile MIMO-OFDM/-OFDMA Baseband Processor Motivations –Most are single user, SISO, downlink OFDM solutions –Training.

1 Dual-V cc SRAM Class presentation for Advanced VLSIPresenter:A.Sammak Adopted from: M. Khellah,A 4.2GHz 0.3mm 2 256kb Dual-V CC SRAM Building Block in.

1 Power-Aware System on a Chip A. Laffely, J. Liang, R. Tessier, C. A. Moritz, W. Burleson University of Massachusetts Amherst Boston Area Architecture.

Implementing Tile-based Chip Multiprocessors with GALS Clocking Styles Zhiyi Yu, Bevan Baas VLSI Computation Lab, ECE Department University of California,

A Low-Area Interconnect Architecture for Chip Multiprocessors Zhiyi Yu and Bevan Baas VLSI Computation Lab ECE Department, UC Davis.

1 of 14 Lab 2: Design-Space Exploration with MPARM.

An FFT for Wireless Protocols Dr. J. Greg Nash Centar ( HAWAI'I INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES Mobile.

Product Overview 박 유 진박 유 진.  Nordic Semiconductor ASA(Norway 1983)  Ultra Low Power Wireless Communication System Solution  Short Range Radio Communication(20.

Tinoosh Mohsenin 2, Houshmand Shirani-mehr 1, Bevan Baas 1 1 University of California, Davis 2 University of Maryland Baltimore County Low Power LDPC Decoder.

SPRING 2012 Assembly Language. Definition 2 A microprocessor is a silicon chip which forms the core of a microcomputer the concept of what goes into a.

Lynn Choi School of Electrical Engineering

Temperature and Power Management

ECE Department, University of California, Davis

System On Chip.

Morgan Kaufmann Publishers

Circuits and Interconnects In Aggressively Scaled CMOS

Ultra-Low-Voltage UWB Baseband Processor

An Improved Split-Row Threshold Decoding Algorithm for LDPC Codes

High Throughput LDPC Decoders Using a Multiple Split-Row Method

Presentation transcript:

A 167-processor Computational Array for Highly-Efficient DSP and Embedded Application Processing Dean Truong, Wayne Cheng, Tinoosh Mohsenin, Zhiyi Yu, Toney Jacobson, Gouri Landge, Michael Meeuwsen, Christine Watnik, Paul Mejia, Anh Tran, Jeremy Webb, Eric Work, Zhibin Xiao and Bevan Baas VLSI Computation Lab University of California, Davis

Outline Goals and Key Ideas The Second Generation AsAP –Processors and Shared Memories –On-chip Communication –Dynamic Voltage & Clock Frequency Analysis and Summary

Project Goals Fully programmable and reconfig. architecture High energy efficiency and performance Exploit task-level parallelism in: –Digital Signal Processing –Multimedia Example: a Wi-Fi baseband receiver

Asynchronous Array of Simple Processors (AsAP) Key Ideas: –Programmable, small, and simple fine-grained cores –Small local memories sufficient for DSP kernels –Globally Asynchronous and Locally Synchronous (GALS) clocking Independent clock frequencies on every core Local oscillator halts when processor is idle f1f2f3 Osc IMem DMem Core

Asynchronous Array of Simple Processors (AsAP) Key Ideas, con’t: –2D mesh, circuit-switched network architecture High throughput of one word per clock cycle Low area overhead Easily scalable array –Increased tolerance to process variations 36-processor fully-functional chip, 0.18 µm, V, 0.66 mm 2 per processor [HotChips 06, ISSCC 06, TVLSI 07, JSSC 08,…]

Outline Goals and Key Ideas The Second Generation AsAP –Processors and Shared Memories –On-chip Communication –Dynamic Voltage & Clock Frequency Analysis and Summary

New Challenges Addressed 1.Reduction in the power dissipation of –Lightly-loaded processors (lowering Vdd) –Unused processors (leakage) 2.Achieving very high efficiencies and speed on common demanding tasks such as FFTs, video motion estimation, and Viterbi decoding 3.Larger, area efficient on-chip memories 4.Efficient, low overhead communication between distant processors

Key features –164 Enhanced prog. procs. –3 Dedicated-purpose procs. –3 Shared memories –Long-distance circuit-switched communication network –Dynamic Voltage and Frequency Scaling (DVFS) 167-processor Computational Platform

Homogenous Processors In-order, single-issue, 6-stage processors –16-bit datapath with MAC and 40-bit accumulator –128x16-bit data memory –128x35-bit instruction memory –Two 64x16-bit FIFOs for inter-processor communication –Over 60 basic instructions and features geared for DSP and multimedia workloads

Fast Fourier Transform (FFT) Uses –OFDM modulation –Spectral analysis, synthesis Runtime configurable from 16-pt to 4096-pt transforms, FFT and IFFT 1.01 mm 2 Preliminary measurements functional at 866 MHz, V –681 M complex Sample/s with 1024-pt complex FFTs MEM O F

Uses –Fundamental communications function (wired, wireless, etc.) –Storage apps; e.g., hard drives Decodes configurable codes up to constraint length 10 with up to 32 different rates 0.17 mm 2 Preliminary measurements functional at 894 MHz, V –82 Mbps at rate=1/2 MEM F O Viterbi Decoder

Uses –H.264, MPEG-2, etc. encoders Supports a number of fixed and programmable search patterns including all H.264 specified block sizes within a 48x48 search range 0.67 mm 2 Preliminary measurements functional at 938 MHz, V –15 billion SADs/sec –Supports 1080p 30fps Motion Estimation for Video Encoding O MEM F

Shared Memories Ports for up to four processors (two connected in this chip) to directly connect to the memory block –Port priority –Port request arbitration –Programmable address generation supporting multiple addressing modes –Uses a 16 KByte single-ported SRAM –One read or write per cycle 0.34 mm 2 Preliminary measurements functional at 1.3 GHz, V –20.8 Gbps peak throughput SRAM F O

Inter-Processor Communication Circuit-switched source-synchronous comm. –8 software controlled outputs and 2 configurable circuit-switched inputs (out of 8 total possible inputs) –Long-distance communication can occur across tiles without disturbing local cores –Long-distance links can be pipelined freq1, VddLow freq2, Offfreq3, VddHigh

Per-Processor Dynamic Voltage & Clock Frequency Each processor tile contains a core that operates at: –A fully-independent clock frequency Any frequency below maximum Halts, restarts, and changes arbitrarily –Dynamically-changeable supply voltage VddHigh or VddLow Disconnected for leakage reduction Each power gate comprises 48 individually-controllable parallel transistors VddAlwaysOn powers DVFS and inter-processor comm. Tile

Dynamic Voltage & Frequency Controller Voltage and frequency are set by: –Static configuration –Software –Hardware (controller) FIFO “fullness” Processor “stalling frequency”

Measured Supply Voltage Switching Slow switching results in negligible power grid noise Early VddCore disconnect from VddLow with oscillator running results in a momentary VddCore voltage droop (circled below) 2 ns 0.9V 1.3V

Measured Supply Voltage Switching Oscillator halting while VddCore disconnects from VddLow and connects to VddHigh results in a negligible voltage droop due to leakage 2 ns 0.9V 1.3V

Outline Goals and Key Ideas The Second Generation AsAP –Processors and Shared Memories –On-chip Communication –Dynamic Voltage & Clock Frequency Analysis and Summary

Die Micrograph and Key Data Single Tile Transistors 325,000 Area 0.17 mm 2 CMOS Tech. 65 nm ST Microelectronics low-leakage Max. frequency V Power (100% active) GHz, 1.3 V MHz, V App. power (802.11a rx) MHz, 1.3 V 55 million transistors, 39.4 mm mm 410 μm FFT Vit Mot. Est. Mem mm Mem

New Parallel Processing Paradigm Intel bit CPU, 1971 –Utilized 2300 transistors The presented chip would have 2300 processors in 19.8mm x 19.8mm New parallel processing paradigm –Enabled by numerous efficient processors –Focus on simplified programming and access to large data sets –Much less focus on load balancing or “wasting” processors for things like memories or routing data

H.264 CAVLC Encoder Context-adaptive variable length coding (CAVLC) used in H.264 baseline encoder 15 processors with one shared memory 30fps 720p 1.07GHz ~ times the throughput of TI C62x and ADSP BF561 (scaled to 65 nm, 1.3 V) Mem

Complete a Baseband Receiver 22 processors plus Viterbi and FFT accelerators Includes: frame detection and synchronization, carrier-frequency offset estimation and compensation, channel equalization

Complete a Baseband Receiver 54 Mbps throughput, MHz, 1.3 V 23x faster than TI C62x, 5x faster than strongARM, 2x faster than SODA (all scaled to V) VIT FFT X X XX

Complete a Baseband Receiver X X XX Re-mapped graph avoids bad processors –Yield enhancement –Self-healing VIT FFT

Summary All processors and shared memories contain fully independent clock oscillators 164 homogenous processors –1.2 GHz, 59 mW, 100% 1.3 V –608 μW, 100% 66 MHz, V Three 16 KB shared memories Three dedicated-purpose processors Long-distance circuit-switched communication increases mapping efficiency with low overhead DVFS nets a 48% reduction in energy for JPEG application with an 8% performance loss

Acknowledgements ST Microelectronics NSF Grant and CAREER award Intel SRC GRC Grant 1598 and CSR Grant 1659 Intellasys UC Micro SEM J.-P. Schoellkopf, K. Torki, S. Dumont, Y.-P. Cheng, R. Krishnamurthy and M. Anders