Presentation is loading. Please wait.

Presentation is loading. Please wait.

Multimedia Characteristics and Optimizations Marilyn Wolf Dept. of EE Princeton University © 2004 Marilyn Wolf.

Similar presentations


Presentation on theme: "Multimedia Characteristics and Optimizations Marilyn Wolf Dept. of EE Princeton University © 2004 Marilyn Wolf."— Presentation transcript:

1 Multimedia Characteristics and Optimizations Marilyn Wolf Dept. of EE Princeton University © 2004 Marilyn Wolf

2 Outline zFritts: compiler studies. zLv: compiler studies. zMemory system optimizations.

3 Basic Characteristics zComparison of operation frequencies with SPEC y(ALU, mem, branch, shift, FP, mult) => (4, 2, 1, 1, 1, 1) yLower frequency of memory and floating-point operations yMore arithmetic operations yLarger variation in memory usage zBasic block statistics yAverage of 5.5 operations per basic block yNeed global scheduling techniques to extract ILP zStatic branch prediction yAverage of 89.5% static branch prediction on training input` yAverage of 85.9% static branch prediction on evaluation input zData types and sizes yNearly 70% of all instructions require only 8 or 16 bit data types

4 Breakdown of Data Types by Media Type

5 Memory Statistics zWorking set size ycache regression: cache sizes 1K to 4MB yassumed line size of 64 bytes ymeasured read and write miss ratios zSpatial locality ycache regression: line sizes 8 to 1024 bytes yassumed cache size of 64 KB ymeasure read and write miss ratios zMemory Results ydata memory:32 KB and 60.8% spatial locality (up to 128 bytes) yinstruction memory:8 KB and 84.8% spatial locality (up to 256 bytes)

6 Data Spatial Locality

7 Multimedia Looping Characteristics zHighly Loop Centric yNearly 95% of execution time spent within two innermost loop levels zLarge Number of Iterations ySignificant processing regularity yAbout 10 iterations per loop on average zPath Ratio indicates Intra-Loop Complexity yComputed as ratio of average number of instructions executed per loop invocation to total number of instructions in loop yAverage path ratio of 78% yIndicates greater control complexity than expected

8 Average Iterations per Loop and Path Ratio - average number of loop iterations - average path ratio

9 Instruction Level Parallelism zInstruction level parallelism ybase model:single issue using classical optimizations only yparallel model:8-issue zExplores only parallel scheduling performance yassumes an ideal processor model yno performance penalties from branches, cache misses, etc.

10 Workload Evaluation Conclusions zOperation Characteristics yMore arithmetic operations; less memory and floating-point usage yLarge variation in memory usage y(ALU, mem, branch, shift, FP, mult) => (4, 2, 1, 1, 1, 1) zGood Static Branch Prediction yMultimedia:10-15% avg. miss ratio yGeneral-purpose:20-30% avg. miss ratio ySimilar basic block sizes (5 instrs per basic block) zPrimarily Small Data Types (8 or 16 bits) yNearly 70% of instructions require 16-bit or smaller data types ySignificant opportunity for subword parallelism or narrower datapaths zMemory yTypically small data and instruction working set sizes yHigh data and instruction spatial locality zLoop-Centric yMajority of execution time spent in two innermost loops yAverage of 10 iterations per loop invocation yPath ratio indicates greater control complexity than expected

11 Architecture Evaluation

12 zDetermine fundamental architecture style yStatically Scheduled => Very Long Instruction Word (VLIW) xallows wider issue xsimple hardware=> potentially higher frequencies yDynamically Scheduled => Superscalar xallows decoupled data memory accesses xeffective at reducing penalties from stall zExamine variety of architecture parameters yFundamental Architecture Style yInstruction Fetch Architecture yHigh Frequency Effects yCache Memory Hierarchy zRelated Work [Lee98] “Media Architecture: General Purpose vs. Multiple Application-Specific Programmable Processors,” DAC-35, 1998. [PChang91] “Comparing Static and Dynamic Code Scheduling for Multiple-Instruction-Issue Processors,” MICRO-24, 1991. [ZWu99] “Architecture Evaluation of Multi-Cluster Wide-Issue Video Signal Processors,” Ph.D. Thesis, Princeton University, 1999. [DZucker95] “A comparison of hardware prefetching techniques for multimedia benchmarks,” Technical Report CSL-TR-95-683, Stanford University, 1995.

13 Fundamental Architecture Evaluation zFundamental architecture evaluation included: yStatic vs. dynamic scheduling yIssue width zFocused on non-memory limited applications yDetermine impact of datapath features independent of memory yAssume memory techniques can solve memory bottleneck zArchitecture model y8-issue processor yOperation latencies targeted for 500 MHz to 1 GHz y64 integer and floating-point registers yPipeline: 1 fetch, 2 decode, 1 write back, variable execute stages y32 KB direct-mapped L1 data cache with 64 byte lines y16 KB direct-mapped L1 instruction cache with 256 byte lines y256 KB 4-way set associate on-chip L2 cache y4:1 Processor to external bus frequency ratio

14 Static versus Dynamic Scheduling - static versus dynamic scheduling for various compiler methods - result of increasing issue width for the given architecture and compiler methods

15 Instruction Fetch Architecture - aggressive versus conservative fetch methods - comparison of dynamic branch prediction schemes

16 Experimental Configuration Single-issue processor zSimpleScalar sim-outorder zSingle issue configuration zRISC

17 Experimental Configuration -Benchmarks zSelected from different area of MediaBench zAdditional real- world applications

18 Baseline benchmark characteristics zMeasure on the single issue processor zExecution time closely related to dynamic instruction count

19 VLIW vs. Single Issue zStatic Code Size zDynamic Operation Count zExecution Speed zBasic Block Size

20 Static Code Size Results

21 Static Code Size Analysis zSimilar Static Code Size zOn average, TM1300 requires 17% more space

22 Dynamic Operation Count Results

23 Dynamic Operation Count Analysis zDynamic instruction counts are similar for two type of processors zOn average, TM1300 needs 20% more operations zISA difference resulted execution time is small

24 Execution Speed Results

25 Execution Speed Analysis zTM1300 executes all benchmarks faster than the single issue processor zOn average, the speedup is 3.4x ywide issue capability, is partly resulted yArchitecture features

26 Unoptimized Basic Block Size Results

27 Unoptimized Basic Block Size Analysis zTrimedia compile provides code with larger basic block size zOn average, the basic block on TM1300 is twice as large

28 Exploiting Special Features zMethods yUsing custom instruction yLoop transformation zMetrics yExecution Speed yMemory Access Count yBasic Block Size yOperation Level Parallelism

29 Execution Time Results

30 Execution Time Analysis z1.5 x average speedup zData transferring intensive, floating point intensive, and table looking intensive applications have less speedup

31 Memory Access Count Results

32 Memory Access Count Analysis zReduced average memory access count zMemory access can be bottleneck (MPEG)

33 Optimized Basic Block Size

34 Optimized Basic Block Size Analysis zSignificant basic block size change results performance gain ( Region)

35 Operation Level Parallelism Results

36 Operation Level Parallelism Analysis zOPC close to 2 zMemory access can be bottleneck zWider bus&super-word parallelism

37 Overall Performance Change Results

38 Overall Performance Change Analysis zTM1300 exhibit significant performance gain over single issue processor z5x speedup on average z10x best case speedup

39 What type of memory system? zCache: ysize, # sets, block size. zOn-chip main memory: yamount, type, banking, network to PEs. zOff-chip main memory: ytype, organization.

40 Memory system optimizations zStrictly software: yEffectively using the cache and partitioned memory. zHardware + software: yScratch-pad memories. yCustom memory hierarchies.

41 Taxonomy of memory optimizations (Wolf/Kandemir) zData vs. code. zArray/buffer vs. non-array. zCache/scratch pad vs. main memory. zCode size vs. data size. zProgram vs. process. zLanguages.

42 Software performance analysis zWorst-case execution time (WCET) analysis (Li/Malik): yFind longest path through CDFG. yCan use annotations of branch probabilities. yCan be mapped onto cache lines. yDifficult in practice---must analyze optimized code. zTrace-driven analysis: yWell understood. yRequires code, input vectors.

43 Software energy/power analysis zAnalytical models of cache (Su/Despain, Kamble/Ghose, etc.): yDecoding, memory core, I/O path, etc. zSystem-level models (Li/Henkel). zPower simulators (Vijaykrishnan et al, Brooks et al).

44 Power-optimizing transformations zKandemir et al: yMost energy is consumed by the memory system, not the CPU core. yPerformance-oriented optimizations reduce memory system energy but increase datapath energy consumption. yLarger caches increase cache energy consumption but reduce overall memory system energy.

45 Scratch pad memories zExplicitly managed local memory. zPanda et al used a static management scheme. yData structures assigned to off-chip memory or scratch pad at compile time. yPut scalars in scratch pad, arrays in main. zMay want to manage scratch pad at run time.

46 Reconfigurable caches zUse compiler to determine best cache configuration for various program regions. yMust be able to quickly reconfigure the cache. yMust be able to identify where program behavior changes.

47 Software methods for cache placement zMcFarling analyzed inter-function dependencies. zTomiyama and Yasuura used ILP. zLi and Wolf used a process-level model. zKirovski et al use profiling information plus graph model. zDwyer/Fernando use bit vectors to construct boudns in instruction caches. zParmeswaran and Henkel use heuristics.

48 Addressing optimizations zAddressing can be expensive: y55% of DSP56000 instructions performed addressing operations in MediaBench. zUtilize specialized addressing registers, pre/post-incr/decrement, etc. yPlace variables in proper order in memory so that simpler operations can be used to calculate next address from previous address.

49 Hardware methods for cache optimization zKirk and Strosnider divided the cache into sections and allocated timing-critical code to its own section.


Download ppt "Multimedia Characteristics and Optimizations Marilyn Wolf Dept. of EE Princeton University © 2004 Marilyn Wolf."

Similar presentations


Ads by Google