Survey of Digital Signal Processors Michael Warner ECD: VLSI Communication Systems
Agenda Industry Trends DSP Architecture DSP Micro-Architecture DSP Systems
Agenda Industry Trends DSP Architecture DSP Micro-Architecture DSP Systems
Moore’s Law Drives Processor Development ™ 486™ Pentium ® Pentium ® II Pentium ® III Pentium ® 4 Itanium ® Transistors per Die Itanium ® Data (Moore) Microprocessor ‘60‘65‘70‘75‘80‘85‘90‘95‘00‘05‘10 Source: Intel internal Doubling the number of transistors every at same price point drives significant product opportunities …especially if you have little regard for power But what if energy-delay had to be reduced every generation by an order of magnitude?
Gene’s Law Drives DSP Development Gene’s Law DSP Power 1, mW/MIPS Year Gene’s Law will have it’s challenges to hold the line!
Digital Audio u MP3 u Real Audio Streaming Video u MPEG 4 u H.263 Connectivity u Internet u Bluetooth Modem Standards TXN UPX 12 3/4 u UMTS u GMS BuyNow? Yes No What’s Driving Gene’s Law?
DSP Design Constraints Technology (uM) Transistors MIPS RAM (bytes) Power (mW/MIPS) Price/MIPS 3 50K $ K 40 2K 12.5 $ M 5,000 3M 0.1 $ DEVICE CAPABILITIES
Agenda Industry Trends DSP Architecture DSP Micro-Architecture DSP Systems
What Makes a DSP a DSP? Hard Real-Time Single-Cycle MAC Multiple Execution Units Custom Data Path High Bandwidth (Flat) Memory Sub-Systems Dual Access Memory Efficient Zero-Overhead Looping Short Pipeline High Bandwidth I/O Specialized Instruction Sets Low Latency Interrupts Sophisticated DMA No Speculation RTOS Soft Real-Time (Application Processor) Single-Cycle MAC Multiple Execution Units Custom Data Path L1D$, L1I$, L2$ with MMU Speculative Fetching and Branching Virtual Memory Protected Memory Virtual Machines Semaphores Context Save and Restore Threading: SMT, IMT Efficient Zero-Overhead Looping Short Pipeline High Bandwidth I/O Specialized Instruction Sets Low Latency Interrupts Sophisticated DMA O/S
Single Cycle MAC MAC’s Typically Determine DSP Performance and Pipeline Length (EX) Most DSP’s Have 2-8 MAC Units MAC’s Typically Operate in Both a Scalar and Vector Mode
Multiple Instruction Units VLIW Architectures Driving ILP Typically Instruction Units M-Unit - MAC S-Unit - Shift L-Unit - ALU D-Unit – Load/Store Industry Has Converged on a ILP of ~8 DDATA_I2 (load data) D2 DS1S2 M1 DS1S2 D1 DS1S2 DDATA_I1 (load data) 2X1X L 1L 1S1 S2 DL SL DDL S2S1 D M2L2S2 D DL SL DDL S2S1 S2 D S1 Registers B0 - B15Registers A0 - A15
High Bandwidth Memory Sub-Systems Multiple Load-Store Units Required to Feed Data Path Tightly Coupled Memory is Typically Dual Ported Harvard Architecture is Heavily Banked Central Arithmetic Logic Unit EXTERNAL MEMORY MUXMUXMUXMUX INTERNAL MEMORY MUXESMUXESMUXESMUXES P ALU SHIFTER B MAC A PCCNTL E C D ARs
Specialized Instruction Sets Base RISC ISA Plus CISC ISA Driven by End Application MAC SAD LMS FIRS Viterbi Support For Both Scalar and Vector Instructions Support For 8, 16 and 32-Bit Instructions Instructions are Highly Orthogonal
Scalar (55x) vs VLIW (64x) Scalar DSP’s Tend to be More CISC Like Hurts Compiler Performance Improves Energy-Delay Improves Code Density Limits Top End Performance VLIW DSP’s Tend to be More RISC Like RISC + GP Regs + Orthogonality Makes For a Good C Compiler Assembler Code Is Challenging RISC ISA Allows for Higher Frequencies Load-Store Hurts Energy-Delay
TMS320C54x
TMS320C54x Protected Pipeline CYCLES P 1 D1D1 F2F2 P3P3 A1A1 D2D2 F3F3 P4P4 R1R1 A2A2 D3D3 F4F4 P5P5 X1X1 P6P6 R2R2 A3A3 D4D4 F5F5 F6F6 X2X2 R3R3 A4A4 D5D5 F1F1 P2P2 D6D6 X3X3 R4R4 A5A5 A6A6 X4X4 R5R5 R6R6 X5X5 X6X6 Fully loaded pipeline Note: Protected Pipeline Limits Micro-Architectural Flexibility and Performance Prefetch: Calculate address of instruction Fetch: Collect instruction Decode: Interpret instruction Access: Collect address of operand Read: Collect operand Execute: Perform operation
TMS320C6xx Arithmetic Logic Unit Auxiliary Logic Unit Multiplier Unit ’C6xx CPU Core Data Path 1 D1M1S1L1 A Register File Data Path 2 L2S2M2D2 B Register File Instruction Decode Instruction Dispatch Program Fetch Interrupts Control Registers Control Logic Emulation Test
TMS320C6xx Exposed Pipeline Fetch PGPSPWPRDPDCE1E2E3E4E5 DecodeExecute Execute Packet 1 Fetch PGProgram Address Generate PS Program Address Send PWProgram Access Ready Wait PRProgram Fetch Packet Receive Decode DPInstruction Dispatch DCInstruction Decode Execute E1 - E5 Execute 1 through Execute 5 PGPSPWPRDPDCE1E2E3E4E5 Execute Packet 2 PGPSPWPRDPDCE1E2E3E4E5 Execute Packet 3 PGPSPWPRDPDCE1E2E3E4E5 Execute Packet 4 PGPSPWPRDPDCE1E2E3E4E5 Execute Packet 5 PGPSPWPRDPDCE1E2E3E4E5 Execute Packet 6 PGPSPWPRDPDCE1E2E3E4E5 Execute Packet 7 PGPSPWPRDPDCE1E2E3E4E5 Note: Exposed Pipeline Adds Risk to Programming Model
Agenda Industry Trends DSP Architecture DSP Micro-Architecture DSP Systems
Micro-Architectural Challenges Accessing (Flat) On Chip Memory At Speed Within 2-3 cycles Feeding Multiple Functional Units From a Single Register File Running 600Mhz+ with a 7-9 Stage Pipeline Linking Multiple Functional Units with Result Forwarding Implementing CISC Data-path to Meet Area and Performance Goals Achieving ARM Like Code Density
Agenda Industry Trends DSP Architecture DSP Micro-Architecture DSP Systems
DSP Systems TMS320C MHz Viterbi and Turbo hardware accelerators Wireless Infrastructure TMS320C MHz Viterbi and Turbo hardware accelerators Wireless Infrastructure 6 DSP 300MHz 3MB integrated memory 180M transistors TMS320C5561 Wired Infrastructure TMS320C MHz Viterbi and Turbo hardware accelerators Wireless Infrastructure 6 DSP 300MHz 3MB integrated memory 180M transistors TMS320C5561 Wired Infrastructure OMAP5910 DSP+GPP Low power consumption Voice, data, video Wireless Client TMS320C MHz Viterbi and Turbo hardware accelerators Wireless Infrastructure 6 DSP 300MHz 3MB integrated memory 180M transistors TMS320C5561 Wired Infrastructure OMAP5910 DSP+GPP Low power consumption Voice, data, video Wireless Client Digital Still Camera TMS320DM310 DSP+GPP Imaging accelerators TMS320C MHz Viterbi and Turbo hardware accelerators Wireless Infrastructure 6 DSP 300MHz 3MB integrated memory 180M transistors TMS320C5561 Wired Infrastructure OMAP5910 DSP+GPP Low power consumption Voice, data, video Wireless Client TMS320DM310 DSP+GPP Imaging accelerators Digital Still Camera 225 MHz Floating point TMS320DA610 Performance Audio TMS320C MHz Viterbi and Turbo hardware accelerators Wireless Infrastructure 6 DSP 300MHz 24Mb integrated memory 180M transistors TMS320C5561 Wired Infrastructure OMAP5910 DSP+GPP Low power consumption Voice, data, video Wireless Client TMS320DM310 DSP+GPP Imaging accelerators Digital Still Camera 225 MHz Floating point TMS320DA610 Performance Audio
VIOP Platform TNETV3010 Features 6 C55x 300 MHz Shared Instruction Memory Broadcast DMA 24M Bits of On Chip SRAM
OMAP Platform OMAP2420 Features ARM 330 MHz, VFP (Vector Floating Point), 32K/32K I/Dcache 220 MHz 2D/3D graphics accelerator IVA supports still images to >4 Mpixels, 30 fps VGA video decode Output to TV for gaming and video playback Encryption hardware for DRM and security ARM11 + VFP 2D/3D Graphics Accelerator Camera I/F Memory Controller Peripherals L4 Interconnect Imaging & Video Accelerator (IVA) Internal SRAM OMAP2420 LCD I/F Video Out L3 Interconnect TMS320C55x DSP Security
IBM Cell Architecture Design Features: Multi-Core Architecture Based on the Power Architecture Code compatibility Coherent and cooperative off-load processing Enhanced SIMD architecture Power efficiency improved “Absolute timers“ allow "hard” real-time data processing Good estimation of execution time is possible Big-endian memory Support Apple, but not Intel Isolation mechanism for secure code execution
FlexIO
DSP Architecture SPE: (synergistic Processing Element) Dual issue, 128-bit 4-way SIMD Vector Processing 4 Integer Units + 4 FP Units 8-,16-,32-bit Integer + 32-,64-bit FP 128x128-bit Registers 256KB Local-Store Memory (specially designed) Caches are not used Data & Instruction in LS