Motivation Mobile embedded systems are present in: –Cell phones –PDA’s –MP3 players –GPS units
Mobile Computing Design Considerations Low power Real-time data processing Small size Low cost Quick time to market
Metric Introduction Processor specialization Instruction set Interconnect Memory specialization Functional & Data path units Power Specialization
Metric: Processor Specialization Central controlling point of embedded system Examples: –VLIW to perform multiple instructions in parallel. –RISC architecture
Metric: Instruction Set Specialization Introduction of new instructions to extract optimal performance from the processor Examples: –Multiply-accumulate –Vector operations
Metric: Interconnect Provides means for different modules to communicate Optimizations can lead to reduced complexity, cost, and power consumption
Metric: Memory Specialization Specialization is achieved through optimization of number and size of memory banks, number and size of access ports Optimizations can improve performance, power consumption, and chip area
Metric: Functional & Data Path Units Functional units are often specialized hardware units implementing a frequently used software algorithm Examples: –DSP co-processors, interrupt priority co- processors, memory access modules, and timer modules
Metric: Power Specialization Major concern in mobile systems Kept under control by: –Using low voltage –Slow clock speed –Custom circuit solutions
Architectures to be discussed M*CORE D30V/MPEG SuperENC 1.3-GOPS Parallel DSP IA-32 w/ Enhanced Data Streaming
M*CORE Low power embedded applications Wireless mobile devices Cellular phones
M*CORE Processor Specialization Simple RISC architecture 4 stage pipeline 16-bit instruction word length Compiler designed in parallel with architecture Barrel shifter built into ALU
M*CORE Instruction Set Specialization Multimedia instructions –Multiple data transfers from memory to register and register to memory. –Fast register saves FF1 – Find First 1 –Finding highest priority interrupt in hardware
M*CORE Interconnect Specialization 16 – bit data bus to match 16 bit word length –Reduces memory bandwidth, complexity, chip area layout, and power consumption MDI – MCU–to-DSP Interface –Dual access memory messaging unit General I/O bus for a peripherals
M*CORE Memory Specialization Alternate register bank –Fast register saves for context switches
M*CORE Functional & Data Path Units 32 channel programmable interrupt controller Protocol timer DSP core
M*CORE Power Specialization 1.8 Volts Uses 0.5 Watts Power aware pipeline Programmable power states –Stop –Wait –Dose –Normal
M*CORE Summary Low power and programmable power states make it ideal for mobile devices Interface to built in DSP core makes it ideal for cell phone applications
650 MHZ IA-32 Microprocessor designed to accelerate data- streaming applications Three-dimensional graphics Video encode/decode
650 MHZ IA-32 Processor Specialization IA-32 architecture 70 new instructions SIMD floating point data type Improvements in regard to circuit implementation
650 MHZ IA-32 Instruction Set Specialization 70 new instructions –SIMD FP operations –Control for new 8-entry register file –Multimedia extension 12 new integer instructions
650 MHZ IA-32 Interconnect Specialization Front Side Bus of 66, 100, 133 MHz Back Side Bus –Half the clock frequency for mobile and desktop applications –Full clock frequency for server/workstation applications
650 MHZ IA-32 Memory Specialization 3 new non-temporal store instructions with write combining buffers –Burst write protocol –Write data throughput of Gbytes/sec on a 133 MHz bus 4 new data pre-fetch instructions –Overlap, reduces cache miss penalties
650 MHZ IA-32 Functional Specialization 8 entry register file –Reduces register starvation for SIMD unit –128 bits wide four independent single precision elements packed in parallel Dedicated table based lookup unit for reciprocal operations –Completes reciprocal operations in one clock cycle –Error of 1.5 * 2^-12
650 MHZ IA-32 Low Power Usage 1.4 V ~ 2.2 V at 650 MHz close to room temperature
650 MHZ IA-32 Performance 1.5X to 2.0X performance boost for 3-D transform and lighting kernels Real-time MPEG-2 video/audio encoding at 30 frames per second –Achieved through improvement to SIMD unit, at a cost of only 2% increase of unit area size
D30V/MPEG Multimedia applications –Decoding MPEG-2
D30V/MPEG Processor Specialization 2 way VLIW Dual issue RISC pipeline 2 way assigned SIMD module Pipeline has ability to re-route data through execution path
D30V/MPEG Instruction Set Specialization Saturate and Add DSP instructions built in –Modular addressing –Block repeat –Multiply accumulate Half word instructions –Effectively double number of useable registers
D30V/MPEG Interconnect Specialization Chip layout specialized for decoding streaming mpeg data
D30V/MPEG Memory Specialization 32 Kbyte data RAM 64 Kbyte instruction RAM 4 Kbyte RAM for Variable Length Encoder/Decoder (VLC/VLD) tables Special Registers –MOD_S & MOD_E for modulo addressing –RPT_S, RPT_E, and RPT_C for looping
D30V/MPEG Functional Specialization VLC/VLD Variable Length Encoding/Decoding units
D30V/MPEG Low Power Usage 2.5 Volts at 243 MHz Uses 2.0 Watts
D30V/MPEG Performance 12 % speedup from inter-pipe bypasses Special VLC/VLD functional blocks speedup MPEG decoding
1.3 GOPS Parallel DSP Achieve real-time image processing capability Employ data parallelism to achieve goal –High level algorithms, non-parallelizable Arithmetic encoding –Medium level algorithms, medium parallelizable Contour tracking of binary images –Low level algorithms, high parallelizable Filters and transforms Data independent control and data flow 80 % of MPEG-2, 60% of MPEG-4
1.3 GOPS Parallel DSP Processor Specialization Central control unit –RISC based –Controls multiple SIMD units
1.3 GOPS Parallel DSP Instruction Set Specialization VLIW instructions –3 instructions per issue 1 load/store 16 bit data 2 arithmetic operations on 16/32 bit data
1.3 GOPS Parallel DSP Interconnect Specialization DMA/MCU (Direct Memory Access/Memory Control Unit) –Handles cache misses –Performs prefetch operations from matrix memory –Interfaces with external 64 bit data bus and 32 bit address bus for SRAM and DRAM modules
1.3 GOPS Parallel DSP Memory Specialization Memory tailored to image processing needs –Provides parallel high bandwidth access to shared data with matrix shaped access patterns Individual Cache Memory –Services irregular memory requests
1.3 GOPS Parallel DSP Functional Specialization Multiple SIMD units –Currently 4 units for prototype –16 units planned for future versions –SIMD approach has been extended with ASIMD, autonomous instruction selection capability Improves handling of conditional branches
1.3 GOPS Parallel DSP Low Power Usage 3.3 Volts Using 650 milliwatts
1.3 GOPS Summary Sustained performance 380 MIPS –Around 90% utilization
SuperENC MPEG-2 video encoder
SuperENC Processor Specialization Software implemented RISC architecture –5 stage pipeline –81 MHz, 32 bit wide data/instruction path Software implemented SIMD/SDIF (SDRAM Interface) modules
SuperENC Instruction Set Specialization There is no instruction set specialization mentioned in the paper.
SuperENC Interconnect Specialization SDIF –All memory access goes through SDIF –Relay data without going to external memory Reduces memory bandwidth and power consumption
SuperENC Memory Specialization Uses external RAM –Can access two 16 Mbit SDRAMS or one 64 Mbit SDRAM
SuperENC Functional Specialization MPEG algorithm is broken up into hardware functional blocks –Example DCT, Discrete Cosine Transfer IDCT, Inverse Discrete Cosine Transfer ME. Motion Estimation MC, Motion Compensation
SuperENC Low Power Usage 2.5 Volts internal 3.3 Volts I/O 1.5 Watts
SuperENC Summary SuperENC makes use of many hardware functional blocks to implement the MPEG decoding algorithm
Metric Results D30V/MPEG highest rated