1 Advanced VLSI Course Class Presentation An Asynchronous Array of Simple Processors for DSP Applications Rasoul Yousefi Electrical and Computer Engineering.

Slides:

Advertisements

Similar presentations

The Raw Architecture Signal Processing on a Scalable Composable Computation Fabric David Wentzlaff, Michael Taylor, Jason Kim, Jason Miller, Fae Ghodrat,

Advertisements

1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.

Altera FLEX 10K technology in Real Time Application.

THE RAW MICROPROCESSOR: A COMPUTATIONAL FABRIC FOR SOFTWARE CIRCUITS AND GENERAL- PURPOSE PROGRAMS Taylor, M.B.; Kim, J.; Miller, J.; Wentzlaff, D.; Ghodrat,

A reconfigurable system featuring dynamically extensible embedded microprocessor, FPGA, and customizable I/O Borgatti, M. Lertora, F. Foret, B. Cali, L.

Houshmand Shirani-mehr 1,2, Tinoosh Mohsenin 3, Bevan Baas 1 1 VCL Computation Lab, ECE Department, UC Davis 2 Intel Corporation, Folsom, CA 3 University.

Presenter: Jeremy W. Webb Course: EEC 289Q: Reconfigurable Computing Course Instructor: Professor Soheil Ghiasi Processor Architectures At A Glance: M.I.T.

Lecture 9: Coarse Grained FPGA Architecture October 6, 2004 ECE 697F Reconfigurable Computing Lecture 9 Coarse Grained FPGA Architecture.

Synchronous Digital Design Methodology and Guidelines

1 Many Core Platform For DSP Applications Seminar in VLSI Architecture Ophir Erez.

Spring 08, Jan 15 ELEC 7770: Advanced VLSI Design (Agrawal) 1 ELEC 7770 Advanced VLSI Design Spring 2007 Introduction Vishwani D. Agrawal James J. Danaher.

Low Power Design for Wireless Sensor Networks Aki Happonen.

Spring 07, Jan 16 ELEC 7770: Advanced VLSI Design (Agrawal) 1 ELEC 7770 Advanced VLSI Design Spring 2007 Introduction Vishwani D. Agrawal James J. Danaher.

Micro-Architecture Techniques for Sensor Network Processors Amir Javidi EECS 598 Feb 25, 2010.

Low-voltage techniques Mohammad Sharifkhani. Reading Text Book I, Chapter 4 Text Book II, Section 11.7.

1 Design and Implementation of Turbo Decoders for Software Defined Radio Yuan Lin 1, Scott Mahlke 1, Trevor Mudge 1, Chaitali.

Author: D. Brooks, V.Tiwari and M. Martonosi Reviewer: Junxia Ma

By Praveen Venkataramani Vishwani D. Agrawal TEST PROGRAMMING FOR POWER CONSTRAINED DEVICES 5/9/201322ND IEEE NORTH ATLANTIC TEST WORKSHOP 1.

Octavo: An FPGA-Centric Processor Architecture Charles Eric LaForest J. Gregory Steffan ECE, University of Toronto FPGA 2012, February 24.

Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.

156 / MAPLD 2005 Rollins 1 Reducing Energy in FPGA Multipliers Through Glitch Reduction Nathan Rollins and Michael J. Wirthlin Department of Electrical.

1 Copyright © 2011, Elsevier Inc. All rights Reserved. Appendix E Authors: John Hennessy & David Patterson.

Lecture 2: Field Programmable Gate Arrays September 13, 2004 ECE 697F Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays.

Institute of Applied Microelectronics and Computer Engineering College of Computer Science and Electrical Engineering, University of Rostock Slide 1 Spezielle.

Basics and Architectures

Highest Performance Programmable DSP Solution September 17, 2015.

RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696

Power Reduction for FPGA using Multiple Vdd/Vth

CAD for Physical Design of VLSI Circuits

MOUSETRAP Ultra-High-Speed Transition-Signaling Asynchronous Pipelines Montek Singh & Steven M. Nowick Department of Computer Science Columbia University,

Logic Synthesis for Low Power(CHAPTER 6) 6.1 Introduction 6.2 Power Estimation Techniques 6.3 Power Minimization Techniques 6.4 Summary.

1 EE 587 SoC Design & Test Partha Pande School of EECS Washington State University

1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah

ASIP Architecture for Future Wireless Systems: Flexibility and Customization Joseph Cavallaro and Predrag Radosavljevic Rice University Center for Multimedia.

Tinoosh Mohsenin and Bevan M. Baas VLSI Computation Lab, ECE Department University of California, Davis Split-Row: A Reduced Complexity, High Throughput.

J. Christiansen, CERN - EP/MIC

A 167-processor 65 nm Computational Platform with Per-Processor Dynamic Supply Voltage and Dynamic Clock Frequency Scaling Dean Truong, Wayne Cheng, Tinoosh.

A 240ps 64b Carry-Lookahead Adder in 90nm CMOS Faezeh Montazeri Advanced VLSI Course Presentation University of Tehran December.

Introduction to FPGA Created & Presented By Ali Masoudi For Advanced Digital Communication Lab (ADC-Lab) At Isfahan University Of technology (IUT) Department.

Radix-2 2 Based Low Power Reconfigurable FFT Processor Presented by Cheng-Chien Wu, Master Student of CSIE,CCU 1 Author: Gin-Der Wu and Yi-Ming Liu Department.

Lecture 10: Logic Emulation October 8, 2013 ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation.

Lecture 13: Logic Emulation October 25, 2004 ECE 697F Reconfigurable Computing Lecture 13 Logic Emulation.

Performance and Power Analysis of Globally Asynchronous Locally Synchronous Multiprocessor Systems Zhiyi Yu, Bevan M. Baas VLSI Computation Lab, ECE department,

COARSE GRAINED RECONFIGURABLE ARCHITECTURES 04/18/2014 Aditi Sharma Dhiraj Chaudhary Pruthvi Gowda Rachana Raj Sunku DAY

Development of Programmable Architecture for Base-Band Processing S. Leung, A. Postula, Univ. of Queensland, Australia A. Hemani, Royal Institute of Tech.,

Tools - LogiBLOX - Chapter 5 slide 1 FPGA Tools Course The LogiBLOX GUI and the Core Generator LogiBLOX L BX.

1 Leakage Power Analysis of a 90nm FPGA Authors: Tim Tuan (Xilinx), Bocheng Lai (UCLA) Presenter: Sang-Kyo Han (ECE, University of Maryland) Published.

Implementing algorithms for advanced communication systems -- My bag of tricks Sridhar Rajagopal Electrical and Computer Engineering This work is supported.

Computer Organization CDA 3103 Dr. Hassan Foroosh Dept. of Computer Science UCF © Copyright Hassan Foroosh 2002.

DSP Architectures Additional Slides Professor S. Srinivasan Electrical Engineering Department I.I.T.-Madras, Chennai –

A 167-processor Computational Array for Highly-Efficient DSP and Embedded Application Processing Dean Truong, Wayne Cheng, Tinoosh Mohsenin, Zhiyi Yu,

Multi-Split-Row Threshold Decoding Implementations for LDPC Codes

FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR Moore’s Law n Gordon Moore: co-founder of Intel. n Predicted that number of transistors.

Baseband Implementation of an OFDM System for 60GHz Radios: From Concept to Silicon Jing Zhang University of Toronto.

An Asynchronous Array of Simple Processors for DSP Applications Zhiyi Yu, Michael Meeuwsen, Ryan Apperson, Omar Sattari, Michael Lai, Jeremy Webb, Eric.

SCORES: A Scalable and Parametric Streams-Based Communication Architecture for Modular Reconfigurable Systems Abelardo Jara-Berrocal, Ann Gordon-Ross NSF.

Seok-jae, Lee VLSI Signal Processing Lab. Korea University

A 1.2V 26mW Configurable Multiuser Mobile MIMO-OFDM/-OFDMA Baseband Processor Motivations –Most are single user, SISO, downlink OFDM solutions –Training.

Implementing Tile-based Chip Multiprocessors with GALS Clocking Styles Zhiyi Yu, Bevan Baas VLSI Computation Lab, ECE Department University of California,

A Low-Area Interconnect Architecture for Chip Multiprocessors Zhiyi Yu and Bevan Baas VLSI Computation Lab ECE Department, UC Davis.

VU-Advanced Computer Architecture Lecture 1-Introduction 1 Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 1.

Low-power Digital Signal Processing for Mobile Phone chipsets

Lynn Choi School of Electrical Engineering

Architecture & Organization 1

An Improved Split-Row Threshold Decoding Algorithm for LDPC Codes

Architecture & Organization 1

A 100 µW, 16-Channel, Spike-Sorting ASIC with On-the-Fly Clustering

High Throughput LDPC Decoders Using a Multiple Split-Row Method

The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.

A High Performance SoC: PkunityTM

Presentation transcript:

1 Advanced VLSI Course Class Presentation An Asynchronous Array of Simple Processors for DSP Applications Rasoul Yousefi Electrical and Computer Engineering Department University of Tehran

2 Copyright Slides used –Zhiyi Yu, Michael Meeuwsen, Ryan Apperson, Omar Sattari, Michael Lai, Jeremy Webb, Eric Work, Tinoosh Mohsenin, Mandeep Singh, Bevan Baas “An Asynchronous Array of Simple rocessors for DSP Applications”, VLSI Computation Lab, ECE Department,University of California, Davis, USA,2006 IEEE International Solid-State Circuits Conference

3 Outline The idea Project objectives Key features of the AsAP processor Design of the AsAP processor Results Trends

4 The idea Using simple processor cores with small instruction and data memory to dramatically reduce area and power while increasing performance [1]

5 Outline The idea Project objectives Key features of the AsAP processor Design of the AsAP processor Results Trends

6 Project Objectives Programming flexibility Increasing performance Reducing power Reducing area Suitable for future fabrication technologies Performance & Energy efficiency Programming flexibility ASIC FPGA Prog. DSP AsAP Performance, Area, Power improvement !!! [1]

7 Outline The idea Project objectives Key features of the AsAP processor Design of the AsAP processor Results Trends

8 Key Features of the AsAP Processor Chip multiprocessor High performance Small memory & simple processor High energy efficiency Globally asynchronous locally synchronous (GALS) Technology scalability Nearest neighbor communication

9 High Performance Through Chip Multiprocessor Increasing the clock frequency is challenging Parallelism is more promising –Instruction level parallelism (VLIW, Superscalar) –Data level parallelism (SIMD) –Task level parallelism Higher clock frequency Increased design complexity & lower energy efficiency Deeper pipeline

10 Task Level Parallelism Well suited for DSP applications scram coding inter- leave mod. map loadfft inter- leave IFFT win- dow up- samp filter scale clip up- samp filter in out training pilots a/g wireless LAN (54 Mbps, 5 GHz) baseband transmit path Task1 Task2 Task3 A B C Memory Task1Task2 A B C Proc. Proc.1 Proc.2 Task3 Proc.3 Improves performance and potentially reduces memory size … Widely available in many DSP applications … … [1]

11 Memory Size in Modern Processors Memory occupies much of the area in modern processors — this reduces area available for the core and consumes large amounts of power Memories frequently contain the critical path Area and energy dissipation can be dramatically decreased while speed can be increased with smaller memories Area breakdown other mem core TI_C64x Itanium SPARC BlGe/L [ISSCC 02, 05] [from1]

12 Can we use small memory ?

13 Small Memory Requirements for DSP Tasks The memory required for common DSP tasks is quite small Several hundred words of memory are sufficient for many DSP tasks TaskIMem (words) DMem (words) N-pt FIR 6 2N 8-pt DCT x8 2-D DCT Conv. coding (k = 7) Huffman encode N-pt convolution 29 2N 64-pt complex FFT Bubble sort 20 1 N merge sort 50 N Square root Exponential [1]

14 A system with many processors can use individual processors or groups of processors to compute individual tasks and intermediate data can be transmitted between processors with a small memory requirement So it is possible to use small memory

15 GALS Clocking Style The challenge of globally synchronous systems –Design difficulty when using high clock frequencies, long clock wires caused by large chip sizes, and large circuit parameter variations –High clock power consumption and lack of flexibility to independently control clock frequencies How about totally asynchronous design style –Lack of EDA tool support –Design complexity and circuit overhead The GALS compromise –Synchronous blocks each operating in their own independent clock domain

16 The GALS architecture basics Power breakdown in a high-performance CPU[2] Datapath Memory Control I/O Clock [Adopted from 2]

17 The GALS architecture basics Eliminating global clock  Eliminating a major source of power consumption[2] Frequency of each SB can be tailored to the local needs  Reducing average frequency and overall power consumption[2] SB1 SB2 SB3 Data Handshake protocol [Adopted from 2]

18 On Chip Communication Methods to avoid global wires –Network on chip (NOC) –Local communication –Nearest neighbor communication nearest local [1]

19 Outline The idea Project objectives Key features of the AsAP processor Design of the AsAP processor Results Trends

20 AsAP Block Diagram GALS array of identical processors –Each processor is a reduced complexity programmable DSP with small memories –Each processor can receive data from any two neighbors and send data to any of its four neighbors [1]

21 IFetch PC Inst. Mem Decode Mem. Read Src Select EXE 1EXE 2EXE 3 Result Select Write Back FIFO 0 RD FIFO 1 RD Data Mem Read DC Mem Read De- code Addr. Gens. ALU Data Mem Write DC Mem Write Proc Output Single Processor Architecture bit instructions 9-stage pipeline Bypass Multiply Accumulator × ACCACC + [1]

22 reset halt clk Programmable Clock Oscillator Standard cell based Configurable frequency –Delay tunable stages using 7 parallel tri-state inverters –1 to 128 clock divider Results –1.66 MHz – 702 MHz –Max gap: 0.08 MHz (1.66 – 500 MHz) ………… … … … … stage_sel... Clock divider [1]

23 Write side Read side east west south Write logic FIFO 0 Read logic east north west south CPUCPU FIFO 1 Processor east north west south Inter-processor Communication Each processor contains two dual-clock FIFOs Rd/Wr in separate clock domains north Binary Gray & Sync. addr SRAM [1]

24 Advantages of the AsAP Clocking System Simplified clock tree design –The maximum span is < 1 mm in 0.18 µm Scalable – easy to add processors Improved energy efficiency –Clock halts in 9 cycles (processor dissipates leakage power only) and restores in < 1 cycle according to work availability –53% and 65% power saving –Independent clock and voltage scaling (individual processor voltage scaling not implemented in this version)

25 Physical Design 0.18 µm TSMC Standard cell Design flow –Completely synthesized, except oscillator –Macro memory blocks used for four main memories –Completely auto placed and routed To simplify physical design, clock gating not implemented in this version

26 Transistors: 1 Proc 230,000 Chip 8.5 million Max speed: V Area: 1 Proc 0.66 mm² Chip 32.1 mm² Power (1 1.8V, 475 MHz): Typical application 32 mW Typical 100% active 84 mW Worst case 144 mW Power (1 0.9V, 116 MHz): Typical application 2.4 mW Single Processor OSC FIFOs DMem IMem 810 µm 5.65 mm 810 µm 5.68 mm Chip Micrograph of the 6 x 6 Array [1]

27 Outline The idea Project objectives Key features of the AsAP processor Design of the AsAP processor Results Trends

28 Area Evaluation Most of AsAP’s area is for the core (66%) Each processor requires a very small area; more than 20x smaller than others Area breakdown comm mem core Processor area (mm ² ) 20x All scaled to 0.13 µm [ISSCC 05, 00; ISCA04; CMPON96] TI RAW Fujitsu BlGe/ AsAP C64x 4-VLIW L TI CELL/ Fujitsu RAW ARM AsAP C64x SPE 4-VLIW [1]

29 Power and Performance Power / Clock frequency / Scale * (mW/MHz) Peak performance density (MOPS/mm²) All scaled to 0.13 µm * Assume 2 ops/cycle for CELL/SPE and 3.3 ops/cycle for TI C64x Note: word widths and workload not factored [ISSCC 05, 00; ISCA04; CMPON96] Higher performance ÷ area, and Lower estimated energy [1]

30 Pad Scram Conv. Code Punc Inter- leave 1 Inter- leave 2 Mod. Map Pilot Insert Train IFFT BR IFFT Mem IFFT BF IFFT Output GI/ Wind. GI/ Wind. IFFT Mem IFFT BF FIR IFFT BF IFFT Mem Output Sync IFFT Data bits To D/A converter a/802.11g Wireless Transmitter Implementation 22 processors Fully functional on chip MHz 30% of 54 Mb/s Code unscheduled and lightly optimized ~10x performance and 35x – 75x lower energy dissipation than 8-way VLIW TI C62x [SIPS 04; ICC 02] [1]

31 Summary Scalable programmable processing array –Many processors on a single chip –Reduced complexity processors with small memories –GALS clocking style –Nearest neighbor communication Results –0.18 µm, V 32 mW application power 84 mW 100% active power 2.4 mW application 116 MHz and 0.9 V –High performance density: 475 MOPS in 0.66 mm² –Well suited for future fabrication technologies

32 Trends Clock Gating Dynamic voltage scaling but adds level- conversion for asynchronous communication interface[3] CAD tool –CAD tools for asynchronous design is mature, but not commercially strong yet.[3]

33 References 1. Zhiyi Yu, Michael Meeuwsen, Ryan Apperson, Omar Sattari, Michael Lai, Jeremy Webb, Eric Work, Tinoosh Mohsenin, Mandeep Singh, Bevan Baas, “ An Asynchronous Array of Simple rocessors for DSP Applications ”, VLSI Computation Lab, ECE Department, University of California, Davis, USA,2006 IEEE International Solid-State Circuits Conference 2. A.Hemani, T.Meincke, S.Kumar, A.Postula, T.olsson, P.Nilsson, J.Oberg, P.Ellervee,D.Lundqvist, “ Lowering power consumption in clock by using Globally Asynchronous Locally Synchronous design style” 3. Anoop Iyer,Diana Marculescu, “Power and Performance Evaluation of Globally Asynchronous Locally Synchronous Processors”, Proceedings of the 29th Annual International Symposium on Computer Architecture (ISCA.02)

34 Question? Thanks for your attention