Survey of Digital Signal Processors Michael Warner ECD: VLSI Communication Systems.

Slides:



Advertisements
Similar presentations
Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow
Advertisements

DSPs Vs General Purpose Microprocessors
Lecture 4 Introduction to Digital Signal Processors (DSPs) Dr. Konstantinos Tatas.
Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Survey of Digital Signal Processors Michael Warner ECD: VLSI Communication Systems.
Computer Architecture & Organization
C66x CorePac: Achieving High Performance. Agenda 1.CorePac Architecture 2.Single Instruction Multiple Data (SIMD) 3.Memory Access 4.Pipeline Concept.
1 Microprocessor-based Systems Course 4 - Microprocessors.
Embedded Systems Programming
Room: E-3-31 Phone: Dr Masri Ayob TK 2123 COMPUTER ORGANISATION & ARCHITECTURE Lecture 4: Computer Performance.
Processor Technology and Architecture
Introduction to ARM Architecture, Programmer’s Model and Assembler Embedded Systems Programming.
RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.
Computer Organization and Assembly language
Computer performance.
Computer Organization & Assembly Language
Department of Computer and Information Science, School of Science, IUPUI Dale Roberts, Lecturer Computer Science, IUPUI CSCI.
Intro to CS Chapt 2 Data Manipualtion 1 Data Manipulation How is data manipulated inside a computer? –How is data input? –How is it stored? –How is it.
Motivation Mobile embedded systems are present in: –Cell phones –PDA’s –MP3 players –GPS units.
Semiconductor Memory 1970 Fairchild Size of a single core –i.e. 1 bit of magnetic core storage Holds 256 bits Non-destructive read Much faster than core.
Basic Microcomputer Design. Inside the CPU Registers – storage locations Control Unit (CU) – coordinates the sequencing of steps involved in executing.
Basics and Architectures
2007 Sept 06SYSC 2001* - Fall SYSC2001-Ch1.ppt1 Computer Architecture & Organization  Instruction set, number of bits used for data representation,
RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696
1 Homework Reading –None (Finish all previous reading assignments) Machine Projects –Continue with MP5 Labs –Finish lab reports by deadline posted in lab.
CLEMSON U N I V E R S I T Y AVR32 Micro Controller Unit Atmel has created the first processor architected specifically for 21st century applications that.
Introduction of Intel Processors
Computers organization & Assembly Language Chapter 0 INTRODUCTION TO COMPUTING Basic Concepts.
History of Microprocessor MPIntroductionData BusAddress Bus
RISC By Ryan Aldana. Agenda Brief Overview of RISC and CISC Features of RISC Instruction Pipeline Register Windowing and renaming Data Conflicts Branch.
ARM for Wireless Applications ARM11 Microarchitecture On the ARMv6 Connie Wang.
Computer Organization - 1. INPUT PROCESS OUTPUT List different input devices Compare the use of voice recognition as opposed to the entry of data via.
Computer Organization & Assembly Language © by DR. M. Amer.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
THE BRIEF HISTORY OF 8085 MICROPROCESSOR & THEIR APPLICATIONS
Chapter 5: Computer Systems Design and Organization Dr Mohamed Menacer Taibah University
Succeeding with Technology Chapter 2 Hardware Designed to Meet the Need The Digital Revolution Integrated Circuits and Processing Storage Input, Output,
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
Fundamentals of Programming Languages-II
Lecture # 10 Processors Microcomputer Processors.
1 x86 Programming Model Microprocessor Computer Architectures Lab Components of any Computer System Control – logic that controls fetching/execution of.
Hardware Architecture
CPU (Central Processing Unit). The CPU is the brain of the computer. Sometimes referred to simply as the processor or central processor, the CPU is where.
Computer Organization Exam Review CS345 David Monismith.
Introduction to Computers - Hardware
William Stallings Computer Organization and Architecture 6th Edition
ARM Embedded Systems
ECE354 Embedded Systems Introduction C Andras Moritz.
Microarchitecture.
Embedded Systems Design
Advanced Topic: Alternative Architectures Chapter 9 Objectives
Architecture & Organization 1
Basic Computer Organization
Introduction to Digital Signal Processors (DSPs)
Architecture & Organization 1
BIC 10503: COMPUTER ARCHITECTURE
Microprocessor & Assembly Language
Morgan Kaufmann Publishers Computer Organization and Assembly Language
What is Computer Architecture?
Introduction to Microprocessor Programming
COMS 361 Computer Organization
What is Computer Architecture?
What is Computer Architecture?
CSE 502: Computer Architecture
ADSP 21065L.
CSE378 Introduction to Machine Organization
Presentation transcript:

Survey of Digital Signal Processors Michael Warner ECD: VLSI Communication Systems

Agenda  Industry Trends  DSP Architecture  DSP Micro-Architecture  DSP Systems

Agenda  Industry Trends  DSP Architecture  DSP Micro-Architecture  DSP Systems

Moore’s Law Drives Processor Development ™ 486™ Pentium ® Pentium ® II Pentium ® III Pentium ® 4 Itanium ® Transistors per Die Itanium ® Data (Moore) Microprocessor ‘60‘65‘70‘75‘80‘85‘90‘95‘00‘05‘10 Source: Intel internal Doubling the number of transistors every at same price point drives significant product opportunities …especially if you have little regard for power But what if energy-delay had to be reduced every generation by an order of magnitude?

Gene’s Law Drives DSP Development Gene’s Law DSP Power 1, mW/MIPS Year Gene’s Law will have it’s challenges to hold the line!

Digital Audio u MP3 u Real Audio Streaming Video u MPEG 4 u H.263 Connectivity u Internet u Bluetooth Modem Standards TXN UPX 12 3/4 u UMTS u GMS BuyNow? Yes No What’s Driving Gene’s Law?

DSP Design Constraints Technology (uM) Transistors MIPS RAM (bytes) Power (mW/MIPS) Price/MIPS 3 50K $ K 40 2K 12.5 $ M 5,000 3M 0.1 $ DEVICE CAPABILITIES

Agenda  Industry Trends  DSP Architecture  DSP Micro-Architecture  DSP Systems

What Makes a DSP a DSP? Hard Real-Time  Single-Cycle MAC  Multiple Execution Units  Custom Data Path  High Bandwidth (Flat) Memory Sub-Systems  Dual Access Memory  Efficient Zero-Overhead Looping  Short Pipeline  High Bandwidth I/O  Specialized Instruction Sets  Low Latency Interrupts  Sophisticated DMA  No Speculation  RTOS Soft Real-Time (Application Processor)  Single-Cycle MAC  Multiple Execution Units  Custom Data Path  L1D$, L1I$, L2$ with MMU  Speculative Fetching and Branching  Virtual Memory  Protected Memory  Virtual Machines  Semaphores  Context Save and Restore  Threading: SMT, IMT  Efficient Zero-Overhead Looping  Short Pipeline  High Bandwidth I/O  Specialized Instruction Sets  Low Latency Interrupts  Sophisticated DMA  O/S

Single Cycle MAC  MAC’s Typically Determine DSP Performance and Pipeline Length (EX)  Most DSP’s Have 2-8 MAC Units  MAC’s Typically Operate in Both a Scalar and Vector Mode

Multiple Instruction Units  VLIW Architectures Driving ILP  Typically Instruction Units  M-Unit - MAC  S-Unit - Shift  L-Unit - ALU  D-Unit – Load/Store  Industry Has Converged on a ILP of ~8 DDATA_I2 (load data) D2 DS1S2 M1 DS1S2 D1 DS1S2 DDATA_I1 (load data) 2X1X L 1L 1S1 S2 DL SL DDL S2S1 D M2L2S2 D DL SL DDL S2S1 S2 D S1 Registers B0 - B15Registers A0 - A15

High Bandwidth Memory Sub-Systems  Multiple Load-Store Units Required to Feed Data Path  Tightly Coupled Memory is Typically Dual Ported  Harvard Architecture is Heavily Banked Central Arithmetic Logic Unit EXTERNAL MEMORY MUXMUXMUXMUX INTERNAL MEMORY MUXESMUXESMUXESMUXES P ALU SHIFTER B MAC A PCCNTL E C D ARs

Specialized Instruction Sets  Base RISC ISA Plus CISC ISA Driven by End Application  MAC  SAD  LMS  FIRS  Viterbi  Support For Both Scalar and Vector Instructions  Support For 8, 16 and 32-Bit Instructions  Instructions are Highly Orthogonal

Scalar (55x) vs VLIW (64x)  Scalar DSP’s Tend to be More CISC Like  Hurts Compiler Performance  Improves Energy-Delay  Improves Code Density  Limits Top End Performance  VLIW DSP’s Tend to be More RISC Like  RISC + GP Regs + Orthogonality Makes For a Good C Compiler  Assembler Code Is Challenging  RISC ISA Allows for Higher Frequencies  Load-Store Hurts Energy-Delay

TMS320C54x

TMS320C54x Protected Pipeline CYCLES P 1 D1D1 F2F2 P3P3 A1A1 D2D2 F3F3 P4P4 R1R1 A2A2 D3D3 F4F4 P5P5 X1X1 P6P6 R2R2 A3A3 D4D4 F5F5 F6F6 X2X2 R3R3 A4A4 D5D5 F1F1 P2P2 D6D6 X3X3 R4R4 A5A5 A6A6 X4X4 R5R5 R6R6 X5X5 X6X6 Fully loaded pipeline Note: Protected Pipeline Limits Micro-Architectural Flexibility and Performance Prefetch: Calculate address of instruction Fetch: Collect instruction Decode: Interpret instruction Access: Collect address of operand Read: Collect operand Execute: Perform operation

TMS320C6xx Arithmetic Logic Unit Auxiliary Logic Unit Multiplier Unit ’C6xx CPU Core Data Path 1 D1M1S1L1 A Register File Data Path 2 L2S2M2D2 B Register File Instruction Decode Instruction Dispatch Program Fetch Interrupts Control Registers Control Logic Emulation Test

TMS320C6xx Exposed Pipeline Fetch PGPSPWPRDPDCE1E2E3E4E5 DecodeExecute Execute Packet 1  Fetch  PGProgram Address Generate  PS Program Address Send  PWProgram Access Ready Wait  PRProgram Fetch Packet Receive  Decode  DPInstruction Dispatch  DCInstruction Decode  Execute  E1 - E5 Execute 1 through Execute 5 PGPSPWPRDPDCE1E2E3E4E5 Execute Packet 2 PGPSPWPRDPDCE1E2E3E4E5 Execute Packet 3 PGPSPWPRDPDCE1E2E3E4E5 Execute Packet 4 PGPSPWPRDPDCE1E2E3E4E5 Execute Packet 5 PGPSPWPRDPDCE1E2E3E4E5 Execute Packet 6 PGPSPWPRDPDCE1E2E3E4E5 Execute Packet 7 PGPSPWPRDPDCE1E2E3E4E5 Note: Exposed Pipeline Adds Risk to Programming Model

Agenda  Industry Trends  DSP Architecture  DSP Micro-Architecture  DSP Systems

Micro-Architectural Challenges  Accessing (Flat) On Chip Memory At Speed Within 2-3 cycles  Feeding Multiple Functional Units From a Single Register File  Running 600Mhz+ with a 7-9 Stage Pipeline  Linking Multiple Functional Units with Result Forwarding  Implementing CISC Data-path to Meet Area and Performance Goals  Achieving ARM Like Code Density

Agenda  Industry Trends  DSP Architecture  DSP Micro-Architecture  DSP Systems

DSP Systems TMS320C MHz Viterbi and Turbo hardware accelerators Wireless Infrastructure TMS320C MHz Viterbi and Turbo hardware accelerators Wireless Infrastructure 6 DSP 300MHz 3MB integrated memory 180M transistors TMS320C5561 Wired Infrastructure TMS320C MHz Viterbi and Turbo hardware accelerators Wireless Infrastructure 6 DSP 300MHz 3MB integrated memory 180M transistors TMS320C5561 Wired Infrastructure OMAP5910 DSP+GPP Low power consumption Voice, data, video Wireless Client TMS320C MHz Viterbi and Turbo hardware accelerators Wireless Infrastructure 6 DSP 300MHz 3MB integrated memory 180M transistors TMS320C5561 Wired Infrastructure OMAP5910 DSP+GPP Low power consumption Voice, data, video Wireless Client Digital Still Camera TMS320DM310 DSP+GPP Imaging accelerators TMS320C MHz Viterbi and Turbo hardware accelerators Wireless Infrastructure 6 DSP 300MHz 3MB integrated memory 180M transistors TMS320C5561 Wired Infrastructure OMAP5910 DSP+GPP Low power consumption Voice, data, video Wireless Client TMS320DM310 DSP+GPP Imaging accelerators Digital Still Camera 225 MHz Floating point TMS320DA610 Performance Audio TMS320C MHz Viterbi and Turbo hardware accelerators Wireless Infrastructure 6 DSP 300MHz 24Mb integrated memory 180M transistors TMS320C5561 Wired Infrastructure OMAP5910 DSP+GPP Low power consumption Voice, data, video Wireless Client TMS320DM310 DSP+GPP Imaging accelerators Digital Still Camera 225 MHz Floating point TMS320DA610 Performance Audio

VIOP Platform  TNETV3010 Features  6 C55x 300 MHz  Shared Instruction Memory  Broadcast DMA  24M Bits of On Chip SRAM

OMAP Platform  OMAP2420 Features  ARM 330 MHz, VFP (Vector Floating Point), 32K/32K I/Dcache  220 MHz  2D/3D graphics accelerator  IVA supports still images to >4 Mpixels, 30 fps VGA video decode  Output to TV for gaming and video playback  Encryption hardware for DRM and security ARM11 + VFP 2D/3D Graphics Accelerator Camera I/F Memory Controller Peripherals L4 Interconnect Imaging & Video Accelerator (IVA) Internal SRAM OMAP2420 LCD I/F Video Out L3 Interconnect TMS320C55x DSP Security

IBM Cell Architecture Design Features:  Multi-Core Architecture  Based on the Power Architecture  Code compatibility  Coherent and cooperative off-load processing  Enhanced SIMD architecture  Power efficiency improved  “Absolute timers“ allow "hard” real-time data processing  Good estimation of execution time is possible  Big-endian memory  Support Apple, but not Intel  Isolation mechanism for secure code execution

FlexIO

DSP Architecture SPE: (synergistic Processing Element)  Dual issue, 128-bit 4-way SIMD  Vector Processing  4 Integer Units + 4 FP Units  8-,16-,32-bit Integer + 32-,64-bit FP  128x128-bit Registers  256KB Local-Store Memory (specially designed)  Caches are not used  Data & Instruction in LS