Survey of Digital Signal Processors Michael Warner ECD: VLSI Communication Systems.

Survey of Digital Signal Processors Michael Warner ECD: VLSI Communication Systems

Agenda  Industry Trends  DSP Architecture  DSP Micro-Architecture  DSP Systems

Moore’s Law Drives Processor Development 4004 8080 8086 80286 386™ 486™ Pentium ® Pentium ® II Pentium ® III Pentium ® 4 Itanium ® Transistors per Die 10 8 10 7 10 6 10 5 10 4 10 3 10 2 10 1 10 0 10 9 10 10 8008 Itanium ® 2 1965 Data (Moore) Microprocessor ‘60‘65‘70‘75‘80‘85‘90‘95‘00‘05‘10 Source: Intel internal Doubling the number of transistors every 18-24 at same price point drives significant product opportunities …especially if you have little regard for power But what if energy-delay had to be reduced every generation by an order of magnitude?

Gene’s Law Drives DSP Development Gene’s Law DSP Power 1,000 100 10 1 0.1 0.01 0.001 0.0001 0.00001 mW/MIPS 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 Year Gene’s Law will have it’s challenges to hold the line!

Digital Audio u MP3 u Real Audio Streaming Video u MPEG 4 u H.263 Connectivity u Internet u Bluetooth Modem Standards TXN 160 + 4 UPX 12 3/4 u UMTS u GMS BuyNow? Yes No What’s Driving Gene’s Law?

DSP Design Constraints Technology (uM) Transistors MIPS RAM (bytes) Power (mW/MIPS) Price/MIPS 3 50K 5 256 250 $30.00 1982 0.8 500K 40 2K 12.5 $0.38 1992 0.1 180M 5,000 3M 0.1 $0.02 2002 DEVICE CAPABILITIES

What Makes a DSP a DSP? Hard Real-Time  Single-Cycle MAC  Multiple Execution Units  Custom Data Path  High Bandwidth (Flat) Memory Sub-Systems  Dual Access Memory  Efficient Zero-Overhead Looping  Short Pipeline  High Bandwidth I/O  Specialized Instruction Sets  Low Latency Interrupts  Sophisticated DMA  No Speculation  RTOS Soft Real-Time (Application Processor)  Single-Cycle MAC  Multiple Execution Units  Custom Data Path  L1D$, L1I$, L2$ with MMU  Speculative Fetching and Branching  Virtual Memory  Protected Memory  Virtual Machines  Semaphores  Context Save and Restore  Threading: SMT, IMT  Efficient Zero-Overhead Looping  Short Pipeline  High Bandwidth I/O  Specialized Instruction Sets  Low Latency Interrupts  Sophisticated DMA  O/S

Single Cycle MAC  MAC’s Typically Determine DSP Performance and Pipeline Length (EX)  Most DSP’s Have 2-8 MAC Units  MAC’s Typically Operate in Both a Scalar and Vector Mode

Multiple Instruction Units  VLIW Architectures Driving ILP  Typically Instruction Units  M-Unit - MAC  S-Unit - Shift  L-Unit - ALU  D-Unit – Load/Store  Industry Has Converged on a ILP of ~8 DDATA_I2 (load data) D2 DS1S2 M1 DS1S2 D1 DS1S2 DDATA_I1 (load data) 2X1X L 1L 1S1 S2 DL SL DDL S2S1 D M2L2S2 D DL SL DDL S2S1 S2 D S1 Registers B0 - B15Registers A0 - A15

High Bandwidth Memory Sub-Systems  Multiple Load-Store Units Required to Feed Data Path  Tightly Coupled Memory is Typically Dual Ported  Harvard Architecture is Heavily Banked Central Arithmetic Logic Unit EXTERNAL MEMORY MUXMUXMUXMUX INTERNAL MEMORY MUXESMUXESMUXESMUXES P ALU SHIFTER B MAC A PCCNTL E C D ARs

Specialized Instruction Sets  Base RISC ISA Plus CISC ISA Driven by End Application  MAC  SAD  LMS  FIRS  Viterbi  Support For Both Scalar and Vector Instructions  Support For 8, 16 and 32-Bit Instructions  Instructions are Highly Orthogonal

Scalar (55x) vs VLIW (64x)  Scalar DSP’s Tend to be More CISC Like  Hurts Compiler Performance  Improves Energy-Delay  Improves Code Density  Limits Top End Performance  VLIW DSP’s Tend to be More RISC Like  RISC + GP Regs + Orthogonality Makes For a Good C Compiler  Assembler Code Is Challenging  RISC ISA Allows for Higher Frequencies  Load-Store Hurts Energy-Delay

TMS320C54x

TMS320C54x Protected Pipeline CYCLES P 1 D1D1 F2F2 P3P3 A1A1 D2D2 F3F3 P4P4 R1R1 A2A2 D3D3 F4F4 P5P5 X1X1 P6P6 R2R2 A3A3 D4D4 F5F5 F6F6 X2X2 R3R3 A4A4 D5D5 F1F1 P2P2 D6D6 X3X3 R4R4 A5A5 A6A6 X4X4 R5R5 R6R6 X5X5 X6X6 Fully loaded pipeline Note: Protected Pipeline Limits Micro-Architectural Flexibility and Performance Prefetch: Calculate address of instruction Fetch: Collect instruction Decode: Interpret instruction Access: Collect address of operand Read: Collect operand Execute: Perform operation

TMS320C6xx Arithmetic Logic Unit Auxiliary Logic Unit Multiplier Unit ’C6xx CPU Core Data Path 1 D1M1S1L1 A Register File Data Path 2 L2S2M2D2 B Register File Instruction Decode Instruction Dispatch Program Fetch Interrupts Control Registers Control Logic Emulation Test

TMS320C6xx Exposed Pipeline Fetch PGPSPWPRDPDCE1E2E3E4E5 DecodeExecute Execute Packet 1  Fetch  PGProgram Address Generate  PS Program Address Send  PWProgram Access Ready Wait  PRProgram Fetch Packet Receive  Decode  DPInstruction Dispatch  DCInstruction Decode  Execute  E1 - E5 Execute 1 through Execute 5 PGPSPWPRDPDCE1E2E3E4E5 Execute Packet 2 PGPSPWPRDPDCE1E2E3E4E5 Execute Packet 3 PGPSPWPRDPDCE1E2E3E4E5 Execute Packet 4 PGPSPWPRDPDCE1E2E3E4E5 Execute Packet 5 PGPSPWPRDPDCE1E2E3E4E5 Execute Packet 6 PGPSPWPRDPDCE1E2E3E4E5 Execute Packet 7 PGPSPWPRDPDCE1E2E3E4E5 Note: Exposed Pipeline Adds Risk to Programming Model

Micro-Architectural Challenges  Accessing (Flat) On Chip Memory At Speed Within 2-3 cycles  Feeding Multiple Functional Units From a Single Register File  Running 600Mhz+ with a 7-9 Stage Pipeline  Linking Multiple Functional Units with Result Forwarding  Implementing CISC Data-path to Meet Area and Performance Goals  Achieving ARM Like Code Density

DSP Systems TMS320C6416 600 MHz Viterbi and Turbo hardware accelerators Wireless Infrastructure TMS320C6416 600 MHz Viterbi and Turbo hardware accelerators Wireless Infrastructure 6 DSP CPU @ 300MHz 3MB integrated memory 180M transistors TMS320C5561 Wired Infrastructure TMS320C6416 600 MHz Viterbi and Turbo hardware accelerators Wireless Infrastructure 6 DSP CPU @ 300MHz 3MB integrated memory 180M transistors TMS320C5561 Wired Infrastructure OMAP5910 DSP+GPP Low power consumption Voice, data, video Wireless Client TMS320C6416 600 MHz Viterbi and Turbo hardware accelerators Wireless Infrastructure 6 DSP CPU @ 300MHz 3MB integrated memory 180M transistors TMS320C5561 Wired Infrastructure OMAP5910 DSP+GPP Low power consumption Voice, data, video Wireless Client Digital Still Camera TMS320DM310 DSP+GPP Imaging accelerators TMS320C6416 600 MHz Viterbi and Turbo hardware accelerators Wireless Infrastructure 6 DSP CPU @ 300MHz 3MB integrated memory 180M transistors TMS320C5561 Wired Infrastructure OMAP5910 DSP+GPP Low power consumption Voice, data, video Wireless Client TMS320DM310 DSP+GPP Imaging accelerators Digital Still Camera 225 MHz Floating point TMS320DA610 Performance Audio TMS320C6416 600 MHz Viterbi and Turbo hardware accelerators Wireless Infrastructure 6 DSP CPU @ 300MHz 24Mb integrated memory 180M transistors TMS320C5561 Wired Infrastructure OMAP5910 DSP+GPP Low power consumption Voice, data, video Wireless Client TMS320DM310 DSP+GPP Imaging accelerators Digital Still Camera 225 MHz Floating point TMS320DA610 Performance Audio

VIOP Platform  TNETV3010 Features  6 C55x DSP @ 300 MHz  Shared Instruction Memory  Broadcast DMA  24M Bits of On Chip SRAM

OMAP Platform  OMAP2420 Features  ARM 1136 @ 330 MHz, VFP (Vector Floating Point), 32K/32K I/Dcache  DSP @ 220 MHz  2D/3D graphics accelerator  IVA supports still images to >4 Mpixels, 30 fps VGA video decode  Output to TV for gaming and video playback  Encryption hardware for DRM and security ARM11 + VFP 2D/3D Graphics Accelerator Camera I/F Memory Controller Peripherals L4 Interconnect Imaging & Video Accelerator (IVA) Internal SRAM OMAP2420 LCD I/F Video Out L3 Interconnect TMS320C55x DSP Security

IBM Cell Architecture Design Features:  Multi-Core Architecture  Based on the Power Architecture  Code compatibility  Coherent and cooperative off-load processing  Enhanced SIMD architecture  Power efficiency improved  “Absolute timers“ allow "hard” real-time data processing  Good estimation of execution time is possible  Big-endian memory  Support Apple, but not Intel  Isolation mechanism for secure code execution

FlexIO

DSP Architecture SPE: (synergistic Processing Element)  Dual issue, 128-bit 4-way SIMD  Vector Processing  4 Integer Units + 4 FP Units  8-,16-,32-bit Integer + 32-,64-bit FP  128x128-bit Registers  256KB Local-Store Memory (specially designed)  Caches are not used  Data & Instruction in LS

Survey of Digital Signal Processors Michael Warner ECD: VLSI Communication Systems.

Similar presentations

Presentation on theme: "Survey of Digital Signal Processors Michael Warner ECD: VLSI Communication Systems."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Survey of Digital Signal Processors Michael Warner ECD: VLSI Communication Systems.

Similar presentations

Presentation on theme: "Survey of Digital Signal Processors Michael Warner ECD: VLSI Communication Systems."— Presentation transcript:

Similar presentations

About project

Feedback