Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Techniques de compilation pour la gestion et l’optimisation de la consommation d’énergie des architectures VLIW Thèse de doctorat Gilles POKAM* 15 Juillet.

Similar presentations


Presentation on theme: "1 Techniques de compilation pour la gestion et l’optimisation de la consommation d’énergie des architectures VLIW Thèse de doctorat Gilles POKAM* 15 Juillet."— Presentation transcript:

1 1 Techniques de compilation pour la gestion et l’optimisation de la consommation d’énergie des architectures VLIW Thèse de doctorat Gilles POKAM* 15 Juillet 2004 *Financement CIFRE de STMicroelectronics

2 2 Low Power Compilation Techniques on VLIW Architectures Ph.D. Thesis Gilles POKAM* July 15, 2004 * Thesis funded by STMicroelectronics

3 3 Motivation n root causes of increase performance èhigher clock frequency è augmentation rate of ~30% each two years è makes programs run faster èhigher level of integration density è process scaling following Moore’s law è grows the architecture complexity n power consumption is quickly becoming a limiting factor

4 4 Illustration of power density growth for general purpose systems 4004 8008 8080 8085 8086 286 386 486 Pentium® P6 1 10 100 1000 10000 19701980199020002010 Year Power Density (W/cm2) Hot Plate Nuclear Reactor today 2004!

5 5 Power as a design cost constraint in embedded systems n embedded systems examples PDAs, cell phones, set-top boxes, etc … n key points affecting design cost include :  average energy (battery autonomy)  heat dissipation (packaging cost)  peak power (components reliability) n In this thesis we are concerned with total power consumption

6 6 Agenda n Motivation n Thesis objectives n Program analysis n Power consumption n ILP compilation analysis n Adaptive cache strategy n Adaptive processor data-path n Conclusions

7 7 The goals of this thesis n to understand the energy issues involved when compiling for performance on VLIW architectures n to come out with hardware/software solutions that improve energy- efficiency

8 8 Why VLIW architectures ? n popular in embedded systems Philips TriMedia Processor Texas Instrument TMS320C62xx Lx Processor HP/STMicroelectronics n provide power/performance alternative to general purpose systems statically scheduled processor compiler is responsible of extracting instruction level parallelism (ILP)

9 9 Research methodology n our analysis standpoint lies in the compiler program analysis è we therefore consider program analysis as a basis for exploring energy reduction techniques n power is also concerned with the underlying micro-architecture adequation of the hardware and the software è we also consider the adequation of the hardware and the software to reduce energy consumption

10 10 Thesis contributions 1. Program analysis èa methodology for characterizing the dynamic behavior of programs at static time 2. VLIW energy issues èheuristic to comprehending the energy issues involved when compiling for ILP 3. Hardware/Software adequation èadaptive compilation schemes targeting 1.the cache subsystem 2.the processor data-path

11 11 Thesis experimental environment Lx VLIW processor 4-issue width 64 GPR, 8 CBR 4 ALUs, 2 MULs, 1 LSU, 1 BU 32KB 4-way data cache 32B data cache block size 32KB 1-way instruction cache 64B instruction cache line size Power model provided by STMicroelectronicsBenchmarks MiBench suite e.g. fft, gsm, susan … MediaBench suite e.g. mpeg, epic … PowerStone suite e.g. summin, whestone, v42bis …

12 12 Agenda n Motivation n Thesis objectives n Program analysis n Power consumption n ILP compilation analysis n Adaptive cache strategy n Adaptive processor data-path n Conclusions

13 13 Why do we need to analyze programs ? n knowledge of the dynamic behavior of a program is essential to determine which program region may benefit most from an optimization n programs use to execute as a series of phaseseach phase having varying dynamic behavior n programs use to execute as a series of phases; each phase having varying dynamic behavior [Sherwood and Calder, 1999] program path which occurs repeatedly n a phase can be assimilated to a program path which occurs repeatedly hot paths n exposing the most frequently executed program paths, i.e. hot paths, to the compiler may help discriminate among power/performance optimizations

14 14 Our approach for program paths analysis n whole-program level instrumentation ([Larus, PLDI 2000]) with main focus on basic block regions n signature to differentiate among dynamic instances of the same region n program paths processed with suffix array to detect all occurrences of repeated sub-paths n heuristics to select hot paths among the sub- paths that appear repeated in the trace

15 15 Approach overview: detecting occurrences of repeated sub- paths Dynamicsignature Suffix array Suffix sorting algorithm based on KMR to detect all occurrences of repeated sub-paths [Karl, Muller and Rosenberg, 1972]

16 16 Hot paths selection n not all repeated sub-paths are of interest : èLocal coverage èLocal coverage: provides local behavior of region èGlobal coverage èGlobal coverage: provides the weight of region in program èDistance reuse èDistance reuse: average distance of consecutive accesses to a region

17 17 Results summary Bench Percentage of hot paths Local coverage (% exec instr.) Glo. coverage (% exec instr.) Dist. Reuse (# of BB) (# of BB) dijkstra2.810.09471.74 adpcm5.88< 0.005900.00 blowfish27.010.062485.00 fft11.7< 0.00574.21 sha20.00.06720.75 bmath15.220.053719.21 patricia5.850.156524.84

18 18 Agenda n Motivation n Thesis objectives n Program analysis n Power consumption n ILP compilation analysis n Adaptive cache strategy n Adaptive processor data-path n Conclusions

19 19 Back to basis … 90%~10% Power = ½ C L V DD 2 a f + V DD I leakage CLCL Current technology 50% future technology trend [SIA, 1999] ~50% dynamic power static power

20 20 Software opportunities for power reduction dynamic power Common techniques: clock-gating for activity reduction clock-gating for activity reduction power supply voltage scaling power supply voltage scaling frequency scaling frequency scaling static power Common techniques: power supply voltage scaling power supply voltage scaling

21 21 Agenda n Motivation n Thesis objectives n Program analysis n Power consumption n ILP compilation analysis n Adaptive cache strategy n Adaptive processor data-path n Conclusions

22 22 Problem summary n we want to understand under which conditions compiling for ILP may degrade energy n main motivation comes from the relation between power growth and ILP compiler n for the rest of this study assume can not be modified (fixed microarchitecture) VLIW compiler architecture complexity

23 23 Metric used n energy and performance must be considered conjointly [Horowitz] to leverage program slowdown and energy reduction n performance to energy ratio (PTE) Goals compare two instances of the same program at the software level lay emphasis on the range of performance values (IPC) that may degrade energy for a given ILP transformation if energy growth is more important than obtained performance improvement the resulting PTE is degraded – for a given ILP transformation if energy growth is more important than obtained performance improvement the resulting PTE is degraded

24 24 Energy Model n the execution of a bundle dissipates an energy : n consider loop intensive kernels … Energy base cost Energy due to execution of bundle Energy due to D-cache misses Energy due to I-cache misses

25 25 We consider hyperblock transformation What is an hyperblock ? construct predicated BB out of a region of BBs correct the effect of eliminating branch instructions by adding compensation code Why hyperblock ? most optimizations do not generate extra work, optimizing for performance = optimizing for power hyperblock augment instructions count, how does this affect energy ? H Hammock region R Hyperblock H br

26 26 Tradeoff analysis n transformation heuristic n impact due to added instructions n influence of on Hammock region R Hyperblock H c < 0 extra work due to compensation code C = 0 no degradation no benefit C > 0 Optimal config m is nb of BB in R N is nb of operations in R or H n is nb of bundles in R or H f is execution frequency

27 27 Conclusions n heuristic shows 17% improvement on a small subset of Powerstone benchmarks n improvement on all benchmarks is restricted due to: è available ILP: for a given IPC value, ILP transformation must result into much higher IPC (e.g. case c < 0) è machine overhead: small IPC improvement has no impact on energy whenever machine overhead dominates (e.g. c <= 0) n suggested research directions hot program paths èbetter usage of available ILP via knowledge of phases execution behavior hot program paths adequation of to the architecture requirements of a program region èbetter managing machine overhead via the adequation of to the architecture requirements of a program region

28 28 Agenda n Motivation n Thesis objectives n Program analysis n Power consumption n ILP compilation analysis n Adaptive cache strategy n Adaptive processor data-path n Conclusions

29 29 Why cache ? n highly power (dynamic and static) consuming components ètypically 80% of total transistor count èoccupies about 50% of total chip area n usually appears with a monolithic configuration in embedded systems (per-application configuration) n varying program phase behavior may suggest us that no best cache size exists for a given application èadequation of cache configuration with program behavior on a per-phase basis n reduction of the number of active and passive transistors reduction of dynamic and static power

30 30 Two major proposals Albonesi [MICRO’99]: selective cache ways Albonesi [MICRO’99]: selective cache ways  disable/enable cache waysProblem disabling cache ways causes lost of data impossible to recover to previous cache cells state! disabling cache ways causes lost of data impossible to recover to previous cache cells state! n Zhang & al. [ISCA’03]: way- concatenation  reduce cache associativity while still maintaining full cache capacityProblem data coherency problem across different cache configurations! data coherency problem across different cache configurations! disconnect 32K 4-way 16K 2-way Way 0 Way 2 Way 3 Way 1 Way 0 Way 1 Way 2 Way 3 32K 4-way 32K 2-way concatenateconcatenate @ @

31 31 Program regions analysis n program regions are sensitive to cache size and associativity n key idea èvary associativity and size according to characteristics of program regions Config 2 32K 4-way 32K 2-way Config 0 32K 4-way 32K 2-way 16K 2-way Config 1 32K 1-way 16K 1-way 8K 1-way summin (MiBench) Config 3 16K 2-way

32 32 Solution for varying cache size n how to keep data ? unaccessed cache ways drowsy mode unaccessed cache ways are put in a low power mode (drowsy mode) n drowsy mode [Flautner ISCA’02] scales down to preserve memory cells state Advantage: – static power is reduced as a by-side of scaling down Disadvantage – 1 cycle delay to wake up a drowsy cache way !

33 33 Solution for varying degree of associativity n maintain data coherency via cache line invalidation tag array is maintained active tag array is maintained active to monitor write accesses cache controller invalidates cache lines with old copy on a write access n we save dynamic energy because lower associativity caches access few memory cells than higher ones reduction of switching transition “a”

34 34 Results summary n three cache designs are compared 1. no adaptive cache scheme 2. adaptation on a per-application basis 3. adaptation on a per-phase basis (our scheme) n 6 out of 8 applications are sensitive to cache size and associativity, resulting in dynamic power reduction of up to 12% n static energy is reduced drastically, on average 80% on all benchmarks n performance can suffer from the one cycle wake up delay. Two applications show ~30% degradation, from which 65% is due to the one cycle delay needed to wake up a drowsy cache way better cache way allocation policy can improve this result

35 35 Agenda n Motivation n Thesis objectives n Program analysis n Power consumption n ILP compilation analysis n Adaptive cache strategy n Adaptive processor data-path n Conclusions

36 36 Motivation n 32 bit-width embedded processors are becoming popular integer scalar programsmultimedia applications n confluence of integer scalar programs and multimedia applications on modern embedded processors n multimedia applications use to operate on 8-bit (e.g. video) or 16-bit data (e.g. audio) ètypically 50% of instructions in MediaBench [Brooks et al., HPCA’99] n detecting the occurrence of these narrow-width operands on a per-region basis may allow èthe adequation of processor data-path width to the bit-width size of a program region

37 37 Techniques to detect narrow-width operands n Dynamic approach detection on a cycle-by-cycle detection on a cycle-by-cycle basis by means of hardware (e.g. zero detection logic) clock-gate the un-significant bytes to save energy problem efficient for GP systems, but required hardware cost often not affordable for embedded systems related work include Brooks et al., HPCA’99 Canal et al., MICRO’00 n Compiler approach static data flow analysis to compute ranges of bit-width values use static data flow analysis to compute ranges of bit-width values for program variables re-encode program variables with smaller bit-width size to save energy problem static analysis limits the opportunity for detecting more narrow-width operands re-encoding must preserve program correctness too conservative! related work include Stephenson et al., PLDI 2000

38 38 Program regions analysis adpcm (BB granularity) n the occurrence of dynamic narrow-width operands at the basic block level can be high n Key idea: èadapt the underlying processor data-path width to the dynamic bit-width size of the region

39 39 Our approach avoid relying on hardware support to detect the occurrences of narrow- width operands avoid relying on static data flow analysis to discover bit-width ranges (too conservative!) Dynamic approach Compiler approach speculative narrow-width execution mode take advantage of runtime information to expose dynamic narrow-width operands to the compiler use instead compiler approach to decide when to switch from normal to narrow-width mode and vice-versa (reconfig.instr.)

40 40 Speculative narrow-width execution: micro-architecture n Recovering scheme simple comparison logic at execute stage upon a miss – pipeline is flushed – instruction is replayed with correct mode recovery scheme may have impact on both performance and energy n Static energy saving adaptive register file width that can be viewed as either a 8/16/32-bit register file unused register file slices are put in a low-power mode (drowsy mode) to reduce static energy n Dynamic energy saving data-path clock-gating when a narrow execution mode is encountered (pipeline latches, ALU) slice-enable 8 bits 16 bits 8 bits Write-back (8/16/32 mode) Bypass (8/16/32 mode) (8/16/32 mode) Slice-enable signal (8/16/32 mode)

41 41 Speculative narrow-width execution: compiler support n regions are rarely composed of narrow-width operands only … n address instructions (AI) usually require larger bit- width size; split AI into address calculation memory access via accumulator register n schedule instructions within a region such that those having one operand with 32-bit width are moved around n insert reconfiguration instructions at each frontier of a region

42 42 Results summary n impact of recovery scheme varies with miss- speculation penalty and availability of narrow- width operands with 5 cycles penalty and 80% narrow-width availability programs show no performance degradation with 25 cycles penalty and 60% narrow-width availability IPC degradation reaches 30% n overall, on the 13 applications from Powerstone, the data-path dynamic energy is reduced by 17% on average n we achieve a 22% reduction of the register file static energy

43 43 Agenda n Motivation n Thesis objectives n Program analysis n Power consumption n ILP compilation analysis n Adaptive cache strategy n Adaptive processor data-path n Conclusions

44 44 Conclusions softwarehardware n power consumption is a matter of both software and hardware software because program execution causes switching transitions (dynamic power) hardware because power consumption grows with architecture complexity n hardware/software techniques must be used conjointly to provide an effective basis for reducing power consumption n this thesis has provided arguments in favor of a profile-driven, compiler-architecture symbiosis approaches to reduce power consumption by èdetecting the occurences of program phases/regions èdiscriminating optimizations that best benefit a phase/region èadapting the micro-architecture w.r.t. the behavior of a phase/region

45 45 Future work n analogy between ILP and DLP èinvestigate the energy issues involved with SIMD compilation è need for SIMD energy model è measure impact of overhead instructions (pack/unpack) n catching different program behaviors with a hot path signature èwill allow to study the interplay of using different reconfiguration techniques to save energy n energy impact of SIMD compilation with an adaptive i-cache n effectiveness of SIMD compilation to exploit narrow-width operands (speculative vectorization techniques ?)


Download ppt "1 Techniques de compilation pour la gestion et l’optimisation de la consommation d’énergie des architectures VLIW Thèse de doctorat Gilles POKAM* 15 Juillet."

Similar presentations


Ads by Google