Architectures for Video Signal Processing Wayne Wolf Dept. of EE Princeton University.

Architectures for Video Signal Processing Wayne Wolf Dept. of EE Princeton University

Outline zMultimedia requirements zArchitectural styles zJason Fritts PhD work: Programmable VSPs

Multimedia requirements zToday, compression is the dominant application. zTomorrow, analysis will be as important: yobject recognition; ysummarization; yanalysis of situations.

Storyboard made of keyframes For political ads, see www.ee.princeton.edu/~caeti

Key frame analysis algorithm zCompute optical flow. zCompute sum of magnitudes of optical flow vectors per frame. zSelect key frames at local minima; min/max ratio is user parameter. motion time keyframe 1 keyframe 2

The multimedia processing funnel data volumedata abstraction pixel processing principal component analysis, hidden Markov models

Styles of video processing zSingle-instruction multiple-data (SIMD). zHeterogeneous multiprocessors. zInstruction set architecture (ISA) extensions. zVery long instruction word (VLIW) processors.

SIMD processing zBroadcast operation to an array of processing elements, each of which has its own data. zWell-suited to regular, data-oriented operations.

A block correlation architecture     3525616 004 007 27 25635 318

Hetereogeneous multiprocessor design zWill need accelerators for quite some time to come: ypower; yperformance. zCandidates for acceleration: ycomplex coding and error correction; ymotion estimation.

Expensive operations Expensive operations can be speeded up by special-purpose units: yspecialized memory accesses; yspecialized datapath operations. Special-purpose units may be useful for only certain parameters: yblock size; ysearch region size.

Communication bandwidth Performance is often limited by communication bandwidth: yinternal; yexternal. Specialized communication topologies can make more efficient use of available bandwidth.

ISA extensions zAugment instruction set of traditional microprocessor to provide media processing instructions: ysmaller word sizes yoperations particular to multimedia (saturation arithmetic)

Why ISA extensions zEasy: provide significant parallelism with small changes to architecture. zCheap: can be implemented with zEffective: provide 2x-4x speedups.

Basic principles of ISA extensions Split data word into subwords to provide single instruction multiple data (SIMD) parallelism. Assemble CPU word from pixels: pixel 1pixel 2pixel 3pixel 4 16 bits 64 bits 16 bits

xb xc Packed compare instruction Used for chromakey: wa wb wc wd xa xd = = = = to packed logical op logo + =

VLIW architectures? zParallel function units, shared register file, static scheduling of operations: register file function unit function unit function unit function unit... instruction decode and memory

VLIW’s popularity zInvented 20 years ago, popular today: yGood compiler technology. yLow control overhead. ySystems-on-silicon eliminates pinout problems. zAdvantages for video: yEmbarrassing parallelism with static scheduling opportunities. yLess problem with code compatibility.

Trimedia TM-1 memory interface video in audio in I2CI2C timers image co-p PCI video out audio out serial VLD co-p VLIW CPU

TM-1 VLIW CPU register file read/write crossbar FU1FU27 slot 1slot 2slot 3slot 4slot 5...

Workload characteristics experiments zGoal: compare media workload characteristics to general-purpose load. zUsed MediaBench benchmarks. zCompiled on Impact compiler, measured with with Impact simulator.

Basic characteristics zComparison of operation frequencies with SPEC y(ALU, mem, branch, shift, FP, mult) => (4, 2, 1, 1, 1, 1) yLower frequency of memory and floating-point operations yMore arithmetic operations yLarger variation in memory usage zBasic block statistics yAverage of 5.5 operations per basic block yNeed global scheduling techniques to extract ILP

Basic characteristics, cont’d zStatic branch prediction yAverage of 89.5% static branch prediction on training input yAverage of 85.9% static branch prediction on evaluation input zData types and sizes yNearly 70% of all instructions require only 8 or 16 bit data types

Breakdown of data types by media type

Memory experiment setup zSpatial locality experiment ycache regression: line sizes 8 to 1024 bytes yassumed cache size of 64 KB ymeasure read and write miss ratios

Data spatial locality

Multimedia looping characteristics zHighly loop centric y95% of CPU time in two innermost loop levels ySignificant processing regularity yAbout 10 iterations per loop on average zComplex loop control y= average # of instructions executed per loop invocation/total # of loop instructions yAverage path ratio of 78%--high complexity

Average iterations per loop and path ratio - average number of loop iterations - average path ratio

Instruction level parallelism zInstruction level parallelism ybase model: single issue using classical optimizations only yparallel model:8-issue zExplores only parallel scheduling performance yassumes an ideal processor model yno performance penalties from branches, cache misses, etc.

ILP results

Workload evaluation conclusions zOperation characteristics yMore arithmetic, less memory and floating-point yLarge variation in memory usage y(ALU, mem, branch, shift, FP, mult) => (4, 2, 1, 1, 1, 1) zGood static branch prediction yMultimedia:10-15% avg. miss ratio yGeneral-purpose:20-30% avg. miss ratio ySimilar basic block sizes (5 instrs per basic block)

Workload evaluation conclusions, cont’d zPrimarily small data types (8 or 16 bits) yNearly 70% of instructions require 16-bit or smaller data types ySignificant opportunity for subword parallelism or narrower datapaths zMemory yTypically small data and instruction working set sizes yHigh data and instruction spatial locality

Workload evaluation conclusions, cont’d zLoop-centric yMajority of execution time spent in two innermost loops yAverage of 10 iterations per loop invocation yPath ratio indicates greater control complexity than expected

VSP architecture evaluation zDetermine fundamental architecture style yStatically Scheduled => Very Long Instruction Word (VLIW) yDynamically Scheduled => Superscalar zExamine variety of architecture parameters yFundamental Architecture Style yInstruction Fetch Architecture yHigh Frequency Effects yCache Memory Hierarchy

Fundamental architecture evaluation zMajor issues: yStatic vs. dynamic scheduling yIssue width zFocused on non-memory limited applications.

Architectural model z8-issue processor zOperation latencies targeted for 500 MHz to 1 GHz z64 integer and floating-point registers zPipeline: 1 fetch, 2 decode, 1 write back, variable execute stages

Architectural model, cont’d z32 KB direct-mapped L1 data cache with 64 byte lines z16 KB direct-mapped L1 instruction cache with 256 byte lines z256 KB 4-way set associate on-chip L2 cache z4:1 Processor to external bus frequency ratio

Static versus Dynamic Scheduling

Increasing issue width

Dynamic branch prediction comparison

Impact of higher processor frequencies zIncreased wire delay at higher frequencies may cause: yLonger operation latencies yDelayed bypassing

Processor frequency models zThree processor models with different operation latencies y250 MHz – 500 MHz: stores – 1, loads – 2, FP – 3, mult – 3, div – 10 y500 MHz – 1 GHz:stores – 2, loads – 3, FP – 4, mult – 5, div – 20 y1 GH – 2 GHz: stores – 3, loads – 4, FP – 5, mult – 7, div – 30

Processor frequency results z10% performance difference between processor models z35% performance degradation for delayed bypassing zOut-of-order scheduling and superscalar compilation least susceptible to high frequency effects y20-30% less performance degradation

Cache evaluation

Evaluation of cache memory hierarchy zConclusions yL2 cache has little impact on performance xuseful for storing state during context switches yExternal memory miss latency is primary memory problem xStreaming data structures will help alleviate this yExternal memory bandwidth is second- most problem

Architecture evaluation conclusions zFundamental Architecture Style yVLIW and In-order superscalar are comparable yOut-of-order superscalar has 70% better performance yHyperblock is most effective compilation technique yIssue widths of 3-4 are sufficient

Architecture conclusions, cont’d. zInstruction Fetch Architecture ySmall dynamic branch predictor provides good performance yAggressive fetch provides little benefit y2% performance degradation for additional pre-execute pipeline stages yInstruction fetch is not critical in media processors

Architecture conclusions, cont’d. zHigh Frequency Effects y10% performance difference between processors with varying operation latencies y35% performance degradation from delayed bypassing yOut-of-order superscalar and superscalar compilation least affected zCache Memory Hierarchy yL1 cache size has little effect on media processing yExternal memory latency and bandwidth are primary bottlenecks

Summary zMultimedia applications are already more complex and will become more so. zProgrammable video architectures enable sophisticated applications. zVideo architectures must be sophisticated enough to handle modern video applications.

Architectures for Video Signal Processing Wayne Wolf Dept. of EE Princeton University.

Similar presentations

Presentation on theme: "Architectures for Video Signal Processing Wayne Wolf Dept. of EE Princeton University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Architectures for Video Signal Processing Wayne Wolf Dept. of EE Princeton University.

Similar presentations

Presentation on theme: "Architectures for Video Signal Processing Wayne Wolf Dept. of EE Princeton University."— Presentation transcript:

Similar presentations

About project

Feedback