Download presentation
Published byMartin West Modified over 9 years ago
1
PROCESSOR ARCHITECTURES FOR MULTIMEDIA APPLICATIONS
Oguz Karacuka
2
What Is Multimedia Processing?
Desktop: – 3D graphics (games) – Speech recognition (voice input) – Video/audio decoding (mpeg-mp3 playback) Servers: – Video/audio encoding (video servers, IP telephony) – Digital libraries and media mining (video servers) – Computer animation, 3D modeling & rendering (movies) Embedded: – 3D graphics (game consoles) – Video/audio decoding&encoding (set top boxes, PVR...) – Image processing (digital cameras) – Signal processing (cellular phones)
3
Characteristics Of Multimedia Apps.
Requirement for real-time response – “Incorrect” result often preferred to slow result – Unpredictability can be bad (e.g. dynamic execution) Narrow data-types – Typical width of data in memory: 8 to 16 bits – Typical width of data during computation: 16 to 32 bits – 64-bit data types rarely needed – Fixed-point arithmetic often replaces floating-point Fine-grain (data) parallelism – Identical operation applied on streams of input data – Branches have high predictability – High instruction locality in small loops or kernels
4
Characteristics Of Multimedia Apps. cont.
Coarse-grain parallelism – Most apps organized as a pipeline of functions – Multiple threads of execution can be used Memory requirements – High bandwidth requirements but can tolerate high latency – High spatial locality (predictable pattern) but low temporal locality – Cache bypassing and prefetching can be crucial
5
Examples of Media Functions
Matrix transpose/multiply (3D graphics) DCT/FFT (Video, audio, communications) Motion estimation (Video encoding, deinterlacing) Gamma correction (3D graphics) Haar transform (Media mining) Median filter (Image processing) Separable convolution (Image processing) Viterbi decode (Communications, speech) Bit packing (Communications, cryptography) …
6
Approaches to Media Processing
VLIW with SIMD extensions (aka mediaprocessors, Adapted Programmable Architectures) Asics/FPGA’s (Dedicated/Function Specific Architectures) Multimedia Processing DSP’s (Flexible Programmable Architectures) Vector Processors General-purpose processors with SIMD extensions coldfire: Dedicated multimedia processors are typically custom designed architectures intended to perform specific multimedia functions. These functions usually include video and audio compression and decompression, and in this case these processors are referred to as video codecs. In addition to support for compression, some advanced multimedia processors provide support for 2D and 3D graphics applications. Designs of dedicated multimedia processors range from fully custom architectures, referred to as function specific architectures, with minimal programmability, to fully programmable architectures. Furthermore, programmable architectures can be classified into flexible programmable architectures, which provide moderate to high flexibility, and adapted programmable architectures, which provide an increased efficiency and less flexibility [1]. The dedicated multimedia processors use a variety of architectural schemes from multiple functional units and a RISC or DSP (digital signal processor) core processors to multiple processor schemes. Furthermore, the latest dedicated processors use single-instruction-multiple-data (SIMD) and very-long-instruction-word (VLIW) architectures, as well as some hybrid schemes. These architectures are presented in Section 3. General-purpose (GP) processors provide support for multimedia by including multimedia instructions into the instruction set. Instead of performing specific multimedia functions (such as compression and 2D/3D graphics), GP processors provide instructions specifically created to support generic operations in video processing. For example, these instructions include support for 8-bit data types (pixels), efficient data addressing and I/O instructions, and even instructions to support motion estimation. The latest processors, such as MMX (Intel), VIS (Sun) and MAX-2 (HP), incorporate some types of SIMD architectures, which perform the same operation in parallel on multiple data elements.
7
Application Example: MPEG Dec.
8
MPEG Encoder & Decoder Complexity
9
Function Specific Architectures
Limited (if any) programmability DSP or RISC core processor for main control Special hardware accelerators for the DCT, quantization, entropy encoding, motion estimation... High efficiency and speed: typically better compared to programmable architectures. The silicon area optimization achieved by function-specific architectures allows lower production cost.
10
Function Specific Architectures
11
Programmable Dedicated Architectures
Increased flexibility: enables the processing of different tasks under software control. Higher cost for design and manufacturing: additional hardware for program control is required. Require software development for the application: parallelization strategies have to be applied
12
Flexible Programmable Architectures
TI’s Multimedia Video Processor (MVP) TMS320C80 coldfire: The MVP combines a RISC master processor and four DSP processors in a crossbar-based SIMD shared-memory architecture, as shown. The master processor can be used for control, floating-point operations, audio processing, or 3D graphics transformations. Each DSP performs all the typical operations of a generalpurpose DSP and can also perform bit-field and multiple-pixel operations. Each DSP has multiple functional elements (multiplier, ALU, local registers, a barrel shifter, address generators, and a program-control flow unit), all controlled by very long 64-bit instruction words (VLIW concept). The RISC processor, DSP processors, and the memory modules are fully interconnected through the global crossbar network that can be switched at an instruction clock rate of 20 ns. A 50 MHz MVP executes more than 2 GOPS.
13
Adapted Programmable Architectures
C-Cube’s VRP – VRP2 coldfire: The VRP2 processor consists of a 32-bit RISC processor and two special functional units for variable length coding and motion estimation, as shown in the block diagram in Figure 7. Specially designed instructions in the RISC processor provide an efficient implementation of the DCT and other video-related operations.
14
VLIW Advanced Architectures
Reduce the number of cycles per instruction required for execution of highly complex and parallel algorithms Multiple independent functional units that are directly controlled by long instruction words. Unefficient use of silicon: requires a giant routing network of buses and crossbar switches. All functional units share a common large register file Code compaction is typically done by a special compiler, which can predict branch outcomes by applying an algorithm known as trace scheduling Can be combined with SIMD arch. for increased parallelism e.g. : Mitsubishi D30V and Philips Semiconductor’s TriMedia coldfire: The VLIW architectural model is used in the latest dedicated multimedia processors. A typical VLIW architecture uses long instruction words with more than hundreds of bits in length. The idea behind VLIW concept is to reduce the number of cycles per instruction required for execution of highly complex and parallel algorithms by the use of multiple independent functional units that are directly controlled by long instruction words. This concept is illustrated in Figure 10, where multiple functional units operate in parallel under control of a long instruction. All functional units share a common large register file [11]. Different fields of the long instruction word contain opcodes to activate different functional units. Programs written for conventional 32-bit instruction word computers must be compacted to fit the VLIW instructions. This code compaction is typically done by a special compiler, which can predict branch outcomes by applying an algorithm known as trace scheduling.
15
Philips TriMedia CPU64 Arch.
16
Philips TriMedia CPU64 Arch.
5 slot VLIW architecture with a 64-bit word size; 27 functional units, offering a choice of operation types in each slot in the instruction any operation can be guarded to provide conditional execution without branching; All functional units provide vector-style subword parallelism on byte, half-word, or word entities. instruction set and functional units optimized with respect to media processing; a single multi-ported register file with bypass network, allowing 1-cycle latency operations; 32 kB, 8-way instruction cache 16 kB, 8-way, quasi-dual ported, data cache; a variable-length (compressed) instruction set design. coldfire: The TriMedia CPU64 architecture is a 5-slot VLIW machine, in principle launching a long instruction every clock cycle. It has a uniform 64-bit wordsize through all functional units, the register file, load/store units, on-chip highway and external memory. The 5 operations in a single instruction can in principle each read 2 register arguments and write one register result every clock cycle. In addition, each operation can be guarded with an optional (4th) register for conditional execution without branch penalty. All functional units provide vector-style subword parallelism on byte, half-word, or word entities. This SIMDstyle operation in each of the 5 slots in parallel allows for a very high media processing throughput. There is almost no support for arithmetic on 64-bit integers, 64-bit (double precision) floating point numbers, or 64-bit address ranges, since this was not considered important for the intended application area. With the exception of floating point divide and square root, all functional units are pipelined, allowing a restart every cycle. The latencies vary from 1 (for operations like add, compare, bitand, bitshift, byteshuffle) to 4 (word multiply with round). A register-file bypass allows an operation result to be used as an argument for a next operation without having to wait for registerfile storage and retrieval.
17
Multiple-instruction, multiple-data (MIMD) architectures
offer 10 to 100 times more throughput than existing VLIW and SIMD architectures Multiple instructions are executed in parallel on multiple data: a control unit for each data path. asynchronous nature increases the complexity of software development.
18
SIMD Extensions to General Purp. Processors
WHY ? Performance – A 1.2GHz Athlon can do MPEG-4 encoding at 6.4fps – One 384Kbps W-CDMA channel requires 6.9 GOPS Power consumption – A 1.2GHz Athlon consumes ~60W – Power consumption increases with clock frequency and complexity Cost – A 1.2GHz Athlon costs ~$62 to manufacture and has a list price of ~$600 (module) (year 2000) – Cost increases with complexity coldfire: The real-time multimedia processing on PCs and workstations is still handled by dedicated multimedia processors. However, the advanced GP processors provide an efficient support for certain multimedia applications. These processors can provide software-only solutions for many multimedia functions, which may significantly reduce the cost of the system. GP processors apply the SIMD approach, described in previous section, by sharing their existing integer or floating-point data paths with a SIMD coprocessor. Many microprocessor instruction sets include instructions for accelerating multimedia applications such as DVD playback, speech recognition and 3D graphics. All leading processor vendors have recently designed GP processors that support multimedia, as shown in Figure 1. The main differences among these processors are in the way they reconfigure the internal register file structure to accommodate SIMD operations, and the multimedia instructions they choose to add.
19
SIMD Extensions to General Purp. Processors
Motivation – Low media-processing performance of GPPs – Cost and lack of flexibility of specialized ASICs for graphics/video – Underutilized datapaths and registers Basic idea: sub-word parallelism – The mismatch between wide data paths and the relatively short data types found in multimedia applications – Treat a 64-bit register as a vector of 2 32-bit or 4 16-bit or 8 8-bit values (short vectors) – Partition 64-bit datapaths to handle multiple narrow operations in parallel Initial constraints – No additional architecture state (registers) – No additional exceptions – Minimum area overhead
20
Overwiew of SIMD Extensions
21
Intel’s MMX Example targeted to accelerate multimedia and communications applications, especially on the Internet. MMX system extends the basic integer instructions: add, subtract, multiply, compare, and shift into SIMD versions. Added DCT / IDCT kernels MPEG-1 video decompression speed up with MMX is about 80%,while some other applications, such as image filtering speed up to 370%.
22
Summary of SIMD Instructions
Integer arithmetic – Addition and subtraction with saturation – Fixed-point rounding modes for multiply and shift – Sum of absolute differences – Multiply-add, multiplication with reduction – Min, max Floating-point arithmetic – Packed floating-point operations – Square root, reciprocal – Exception masks Data communication – Merge, insert, extract – Pack, unpack (width conversion)
23
Summary of SIMD Instructions
Comparisons – Integer and FP packed comparison – Compare absolute values – Element masks and bit vectors Memory – No new load-store instructions for short vector – No support for strides or indexing – Short vectors handled with 64b load and store instructions – Pack, unpack, shift, rotate, shuffle to handle alignment of narrow data-types within a wider one – Prefetch instructions for utilizing temporal locality
24
SIMD Ext. for GPP Summary
Narrow vector extensions for GPPs – 64b or 128b registers as vectors of 32b, 16b, and 8b elements Based on sub-word parallelism and partitioned datapaths Instructions – Packed fixed- and floating-point, multiply-add, reductions – Pack, unpack, permutations 2x to 4x performance improvement over base architecture – Limited by memory bandwidth Difficult to use (no compilers) Overhead of handling alignment and datawidth adjustment Optimized shared libraries – Written in assembly, distributed by vendor – Need well defined API for data format and use
25
SUMMARY Computationally intensive multimedia functions, such as MPEG encoding, HDTV codecs, 3D processing, and virtual reality, will still require dedicated processors We should expect that new generations of GP processors would devote more and more transistors to multimedia by investing some of the available chip real estate to support multimedia.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.