Download presentation
Presentation is loading. Please wait.
Published byChristine Simmons Modified over 9 years ago
1
Multimedia Characteristics and Optimizations Marilyn Wolf Dept. of EE Princeton University © 2004 Marilyn Wolf
2
Outline zFritts: compiler studies. zLv: compiler studies. zMemory system optimizations.
3
Basic Characteristics zComparison of operation frequencies with SPEC y(ALU, mem, branch, shift, FP, mult) => (4, 2, 1, 1, 1, 1) yLower frequency of memory and floating-point operations yMore arithmetic operations yLarger variation in memory usage zBasic block statistics yAverage of 5.5 operations per basic block yNeed global scheduling techniques to extract ILP zStatic branch prediction yAverage of 89.5% static branch prediction on training input` yAverage of 85.9% static branch prediction on evaluation input zData types and sizes yNearly 70% of all instructions require only 8 or 16 bit data types
4
Breakdown of Data Types by Media Type
5
Memory Statistics zWorking set size ycache regression: cache sizes 1K to 4MB yassumed line size of 64 bytes ymeasured read and write miss ratios zSpatial locality ycache regression: line sizes 8 to 1024 bytes yassumed cache size of 64 KB ymeasure read and write miss ratios zMemory Results ydata memory:32 KB and 60.8% spatial locality (up to 128 bytes) yinstruction memory:8 KB and 84.8% spatial locality (up to 256 bytes)
6
Data Spatial Locality
7
Multimedia Looping Characteristics zHighly Loop Centric yNearly 95% of execution time spent within two innermost loop levels zLarge Number of Iterations ySignificant processing regularity yAbout 10 iterations per loop on average zPath Ratio indicates Intra-Loop Complexity yComputed as ratio of average number of instructions executed per loop invocation to total number of instructions in loop yAverage path ratio of 78% yIndicates greater control complexity than expected
8
Average Iterations per Loop and Path Ratio - average number of loop iterations - average path ratio
9
Instruction Level Parallelism zInstruction level parallelism ybase model:single issue using classical optimizations only yparallel model:8-issue zExplores only parallel scheduling performance yassumes an ideal processor model yno performance penalties from branches, cache misses, etc.
10
Workload Evaluation Conclusions zOperation Characteristics yMore arithmetic operations; less memory and floating-point usage yLarge variation in memory usage y(ALU, mem, branch, shift, FP, mult) => (4, 2, 1, 1, 1, 1) zGood Static Branch Prediction yMultimedia:10-15% avg. miss ratio yGeneral-purpose:20-30% avg. miss ratio ySimilar basic block sizes (5 instrs per basic block) zPrimarily Small Data Types (8 or 16 bits) yNearly 70% of instructions require 16-bit or smaller data types ySignificant opportunity for subword parallelism or narrower datapaths zMemory yTypically small data and instruction working set sizes yHigh data and instruction spatial locality zLoop-Centric yMajority of execution time spent in two innermost loops yAverage of 10 iterations per loop invocation yPath ratio indicates greater control complexity than expected
11
Architecture Evaluation
12
zDetermine fundamental architecture style yStatically Scheduled => Very Long Instruction Word (VLIW) xallows wider issue xsimple hardware=> potentially higher frequencies yDynamically Scheduled => Superscalar xallows decoupled data memory accesses xeffective at reducing penalties from stall zExamine variety of architecture parameters yFundamental Architecture Style yInstruction Fetch Architecture yHigh Frequency Effects yCache Memory Hierarchy zRelated Work [Lee98] “Media Architecture: General Purpose vs. Multiple Application-Specific Programmable Processors,” DAC-35, 1998. [PChang91] “Comparing Static and Dynamic Code Scheduling for Multiple-Instruction-Issue Processors,” MICRO-24, 1991. [ZWu99] “Architecture Evaluation of Multi-Cluster Wide-Issue Video Signal Processors,” Ph.D. Thesis, Princeton University, 1999. [DZucker95] “A comparison of hardware prefetching techniques for multimedia benchmarks,” Technical Report CSL-TR-95-683, Stanford University, 1995.
13
Fundamental Architecture Evaluation zFundamental architecture evaluation included: yStatic vs. dynamic scheduling yIssue width zFocused on non-memory limited applications yDetermine impact of datapath features independent of memory yAssume memory techniques can solve memory bottleneck zArchitecture model y8-issue processor yOperation latencies targeted for 500 MHz to 1 GHz y64 integer and floating-point registers yPipeline: 1 fetch, 2 decode, 1 write back, variable execute stages y32 KB direct-mapped L1 data cache with 64 byte lines y16 KB direct-mapped L1 instruction cache with 256 byte lines y256 KB 4-way set associate on-chip L2 cache y4:1 Processor to external bus frequency ratio
14
Static versus Dynamic Scheduling - static versus dynamic scheduling for various compiler methods - result of increasing issue width for the given architecture and compiler methods
15
Instruction Fetch Architecture - aggressive versus conservative fetch methods - comparison of dynamic branch prediction schemes
16
Experimental Configuration Single-issue processor zSimpleScalar sim-outorder zSingle issue configuration zRISC
17
Experimental Configuration -Benchmarks zSelected from different area of MediaBench zAdditional real- world applications
18
Baseline benchmark characteristics zMeasure on the single issue processor zExecution time closely related to dynamic instruction count
19
VLIW vs. Single Issue zStatic Code Size zDynamic Operation Count zExecution Speed zBasic Block Size
20
Static Code Size Results
21
Static Code Size Analysis zSimilar Static Code Size zOn average, TM1300 requires 17% more space
22
Dynamic Operation Count Results
23
Dynamic Operation Count Analysis zDynamic instruction counts are similar for two type of processors zOn average, TM1300 needs 20% more operations zISA difference resulted execution time is small
24
Execution Speed Results
25
Execution Speed Analysis zTM1300 executes all benchmarks faster than the single issue processor zOn average, the speedup is 3.4x ywide issue capability, is partly resulted yArchitecture features
26
Unoptimized Basic Block Size Results
27
Unoptimized Basic Block Size Analysis zTrimedia compile provides code with larger basic block size zOn average, the basic block on TM1300 is twice as large
28
Exploiting Special Features zMethods yUsing custom instruction yLoop transformation zMetrics yExecution Speed yMemory Access Count yBasic Block Size yOperation Level Parallelism
29
Execution Time Results
30
Execution Time Analysis z1.5 x average speedup zData transferring intensive, floating point intensive, and table looking intensive applications have less speedup
31
Memory Access Count Results
32
Memory Access Count Analysis zReduced average memory access count zMemory access can be bottleneck (MPEG)
33
Optimized Basic Block Size
34
Optimized Basic Block Size Analysis zSignificant basic block size change results performance gain ( Region)
35
Operation Level Parallelism Results
36
Operation Level Parallelism Analysis zOPC close to 2 zMemory access can be bottleneck zWider bus&super-word parallelism
37
Overall Performance Change Results
38
Overall Performance Change Analysis zTM1300 exhibit significant performance gain over single issue processor z5x speedup on average z10x best case speedup
39
What type of memory system? zCache: ysize, # sets, block size. zOn-chip main memory: yamount, type, banking, network to PEs. zOff-chip main memory: ytype, organization.
40
Memory system optimizations zStrictly software: yEffectively using the cache and partitioned memory. zHardware + software: yScratch-pad memories. yCustom memory hierarchies.
41
Taxonomy of memory optimizations (Wolf/Kandemir) zData vs. code. zArray/buffer vs. non-array. zCache/scratch pad vs. main memory. zCode size vs. data size. zProgram vs. process. zLanguages.
42
Software performance analysis zWorst-case execution time (WCET) analysis (Li/Malik): yFind longest path through CDFG. yCan use annotations of branch probabilities. yCan be mapped onto cache lines. yDifficult in practice---must analyze optimized code. zTrace-driven analysis: yWell understood. yRequires code, input vectors.
43
Software energy/power analysis zAnalytical models of cache (Su/Despain, Kamble/Ghose, etc.): yDecoding, memory core, I/O path, etc. zSystem-level models (Li/Henkel). zPower simulators (Vijaykrishnan et al, Brooks et al).
44
Power-optimizing transformations zKandemir et al: yMost energy is consumed by the memory system, not the CPU core. yPerformance-oriented optimizations reduce memory system energy but increase datapath energy consumption. yLarger caches increase cache energy consumption but reduce overall memory system energy.
45
Scratch pad memories zExplicitly managed local memory. zPanda et al used a static management scheme. yData structures assigned to off-chip memory or scratch pad at compile time. yPut scalars in scratch pad, arrays in main. zMay want to manage scratch pad at run time.
46
Reconfigurable caches zUse compiler to determine best cache configuration for various program regions. yMust be able to quickly reconfigure the cache. yMust be able to identify where program behavior changes.
47
Software methods for cache placement zMcFarling analyzed inter-function dependencies. zTomiyama and Yasuura used ILP. zLi and Wolf used a process-level model. zKirovski et al use profiling information plus graph model. zDwyer/Fernando use bit vectors to construct boudns in instruction caches. zParmeswaran and Henkel use heuristics.
48
Addressing optimizations zAddressing can be expensive: y55% of DSP56000 instructions performed addressing operations in MediaBench. zUtilize specialized addressing registers, pre/post-incr/decrement, etc. yPlace variables in proper order in memory so that simpler operations can be used to calculate next address from previous address.
49
Hardware methods for cache optimization zKirk and Strosnider divided the cache into sections and allocated timing-critical code to its own section.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.