Download presentation
Presentation is loading. Please wait.
Published byJaime Stonestreet Modified over 10 years ago
1
ECE 260C – VLSI Advanced Topics Term paper presentation May 27, 2014 Keyuan Huang Ngoc Luong Low Power Processor Architectures and Software Optimization Techniques
2
Motivation ~10 billion mobile devices in 2018 Moore’s law is slowing down Power dissipation per gate remains unchanged How to reduce power? Circuit level optimizations (DVFS, power gating, clock gating) Microarchitecture optimization techniques Compiler optimization techniques Global Mobile Devices and Connections Growth Trend: More innovations on architectural and software techniques to optimize power consumption
3
Low Power Architectures Overview Asynchronous Processors Eliminate clock and use handshake protocol Save clock power but higher area Ex: SNAP, ARM996HS, SUN Sproull. Application Specific Instruction Set Processors Applications: cryptography, signal processing, vector processing, physical simulation, computer graphic Combine basic instructions with custom instruction based on application Ex: Tensilica’s Extensa, Altera’s NIOS, Xilinx Microblaze, Sony’s Cell, IRAM, Intel’s EXOCHI Reconfigurable Instruction Set Processors Combine fixed core with reconfigurable logic (FPGA) Low NRE cost vs ASIP Ex: Chimaera, GARP, PRISC, Warp, Tensilica’s Stenos, OptimoDE, PICO No Instruction Set Computer Build custom datapath based on application code Compiler has low-level control of hardware resource Ex: WISHBONE system. Qadri, Muhammad Yasir, Hemal S. Gujarathi, and Klaus D. McDonald-Maier. "Low Power Processor Architectures and Contemporary Techniques for Power Optimization--A Review." Journal of computers 4.10 (2009).
4
Combine GP processor with ASIP to focus on reducing energy and energy delay for a range of applications Broader range of applications compared to accelerator Reconfigurable via patching algorithm Automatically synthesizable by toolchain from C source code Energy consumption is reduced up to 16x for functions and 2.1x for whole application Venkatesh, Ganesh, et al. "Conservation cores: reducing the energy of mature computations." ACM SIGARCH Computer Architecture News. Vol. 38. No. 1. ACM, 2010.
5
C-core organization Data path (FU, mux, register) Control unit (state machine) Cache interface (ld, st) Scan chain (CPU interface) Venkatesh, Ganesh, et al. "Conservation cores: reducing the energy of mature computations." ACM SIGARCH Computer Architecture News. Vol. 38. No. 1. ACM, 2010.
6
C-core execution Compiler insert stubs into code compatible with c-core Choose between c-core and CPU and use c-core if available If no c-core available, use GP processor, else use c-core to execute C-core raises exception when finish executing and return the value to CPU Venkatesh, Ganesh, et al. "Conservation cores: reducing the energy of mature computations." ACM SIGARCH Computer Architecture News. Vol. 38. No. 1. ACM, 2010.
7
Patching support Basic block mapping Control flow mapping Register mapping Patch generation Venkatesh, Ganesh, et al. "Conservation cores: reducing the energy of mature computations." ACM SIGARCH Computer Architecture News. Vol. 38. No. 1. ACM, 2010.
8
Patching Example Configurable constants Generalized single-cycle datapath operators Control flow changes Venkatesh, Ganesh, et al. "Conservation cores: reducing the energy of mature computations." ACM SIGARCH Computer Architecture News. Vol. 38. No. 1. ACM, 2010.
9
Results 18 fully placed-and routed c-cores vs MIPS 3.3x – 16x energy efficiency improvement Reduce system energy consumption by upto 47% Reduce energy-delay by up to 55% at the full application level Even higher energy saving without patching support Venkatesh, Ganesh, et al. "Conservation cores: reducing the energy of mature computations." ACM SIGARCH Computer Architecture News. Vol. 38. No. 1. ACM, 2010.
10
Memory system uses power (1/10 to ¼) in portable computers System bus switching activity controlled by software ALU and FPU data paths needs good scheduling to avoid pipeline stalls Control logic and clock reduce by using shortest possible program to do the computation Software Optimization Technique K. Roy and M. C. Johnson, Software design for low power, 1996 :NATO Advanced Study Institute on Low Power Design in Deep Submicron Electronics
11
General categories of software optimization Minimizing memory accesses Minimize accesses needed by algorithm Minimize total memory size needed by algorithm Use multiple-word parallel loads, not single word loads Optimal selection and sequencing of machine instruction Instruction packing Minimizing circuit state effect Operand swapping K. Roy and M. C. Johnson, Software design for low power, 1996 :NATO Advanced Study Institute on Low Power Design in Deep Submicron Electronics
12
Banking, dynamic voltage/frequency, scaling, dynamic resizing + Transparent to the user + Handle arbitrary instr/data accesses ー Limited program information Banking, dynamic voltage/frequency, scaling, dynamic resizing + Transparent to the user + Handle arbitrary instr/data accesses ー Limited program information Software controlled scratch-pad, data/code reorganization + Whole program information + Proactive ー Conservative Software controlled scratch-pad, data/code reorganization + Whole program information + Proactive ー Conservative Global program knowledge Proactive optimizations Efficient execution Basic Idea: Compiler Managed, Hardware Assisted Compiler Managed Partitioned Data Caches for Low Power Rajiv Ravindran,Michael Chu,Scott Mahlke HardwareSoftware Ravindran, R., Chu, M., Mahlke, S.: Compiler Managed Partitioned Data Caches for Low Power. In: LCTES 2007 (2007)
13
tagsetoffset =? tagdatalrutagdatalru tagdatalru tagdatalru 4:1 mux Replace Lookup Activate all ways on every access Replacement Choose among all the ways Traditional Cache Architecture – Fixed replacement policy – Set index no program locality – Set-associativity has high overhead – Activate multiple data/tag-array per access – Fixed replacement policy – Set index no program locality – Set-associativity has high overhead – Activate multiple data/tag-array per access Disadvantages
14
PartitionedCacheArchitecture Partitioned Cache Architecture tagsetoffset =? tagdatalrutagdatalru tagdatalru tagdatalru Ld/St Reg [Addr] [k-bitvector] [R/U] 4:1 mux Replace Lookup Restricted to partitions specified in bit-vector if ‘R’, else default to all partitions Replacement Restricted to partitions specified in bit-vector P0P3P2P1 + Improve performance by controlling replacement + Reduce cache access power by restricting number of accesses + Improve performance by controlling replacement + Reduce cache access power by restricting number of accesses Advantages
15
for (i = 0; i < N1; i++) { … for (j = 0; j < N2; j++) y[i + j] += *w1++ + x[i + j] for (k = 0; k < N3; k++) y[i + k] += *w2++ + x[i + k] } ld1/st1 ld2/st2 ld3 ld4 ld5 ld6 Partitioned Caches: Example (a) Annotated code segment (c) Trace consisting of array references, cache blocks, and load/stores from the example (b) Fused load/store instructions tagdatatagdatatagdata ld1, st1, ld2, st2 ld5, ld6ld3, ld4 ld1 [100], R ld5 [010], R ld3 [001], R y w1/w2 x part-0 part-1 part-3 (d) Actual cache partition assignment for each instuction
16
Compiler Controlled Data Partitioning Goal: Place loads/stores into cache partitions Analyze application’s memory characteristics Cache requirements Number of partitions per ld/st Predict conflicts Place loads/stores to different partitions Satisfies its caching needs Avoid conflicts, overlap if possible
17
Cache Analysis: Estimating Number of Partitions X W1 Y Y X W1 Y Y X W2 Y Y X W2 Y Y j-loop k-loop M M MM B1 B2 B1 M has reuse distance = 1 Minimal partitions to avoid conflict/capacity misses Probabilistic hit-rate estimate Use the reuse distance to compute number of partitions
18
Cache Analysis: Estimating Number Of Partitions 1 1 1 1 8 16 24 32 1234 D = 0.76.98 1 1 8 16 24 32 1234 D = 2.87 1 1 1 8 16 24 32 1234 D = 1 Avoid conflict/capacity misses for an instruction Estimates hit-rate based on Reuse-distance (D), total number of cache blocks (B), associativity (A) Compute energy matrices in reality Pick most energy efficient configuration per instruction (Brehob et. al., ’99)
19
Cache Analysis: Computing Interferences Avoid conflicts among temporally co-located references Model conflicts using interference graph M4 D = 1 M3 D = 1 M2 D = 1 M1 D = 1 X W1 Y Y X W1 Y Y X W2 Y Y X W2 Y Y M4 M2 M1 M1 M4 M3 M1 M1
20
Partition Assignment Placement phase can overlap references Compute combined working-set Use graph-theoretic notion of a clique For each clique, new D Σ D of each node Combined D for all overlaps Max (All cliques) M4 D = 1 M3 D = 1 M2 D = 1 M1 D = 1 Clique 2 Clique 1 Clique 1 : M1, M2, M4 New reuse distance (D) = 3 Clique 2 : M1, M3, M4 New reuse distance (D) = 3 Combined reuse distance Max(3, 3) = 3 tagdatatagdatatagdata ld1, st1, ld2, st2ld5, ld6ld3, ld4 part-0 part-2part-1 ld1 [100], R ld5 [010], R ld3 [001], R yw1/w2x Actual cache partition assignment for each instruction
21
Experimental Setup Trimaran compiler and simulator infrastructure ARM9 processor model Cache configurations: 1-Kb to 32-Kb 32-byte block size 2, 4, 8 partitions vs. 2, 4, 8-way set-associative cache Mediabench suite CACTI for cache energy modeling
22
Reduction in Tag & Data-Array Checks 0 1 2 3 4 5 6 7 8 1-K2-K4-K8-K16-K32-KAverage Average way accesses 8-part4-part2-part Cache size Ravindran, R., Chu, M., Mahlke, S.: Compiler Managed Partitioned Data Caches for Low Power. In: LCTES 2007 (2007) 25%,30%,36 % access reduction on a 2-, 4-, 8-partition cache
23
Improvement in Fetch Energy 0 10 20 30 40 50 60 rawcaudio rawdaudio g721encodeg721decode mpeg2decmpeg2enc pegwitencpegwitdec pgpencodepgpdecode gsmencodegsmdecode epic unepic cjpeg djpeg Average Percentage energy improvement 2-part vs 2-way4-part vs 4-way8-part vs 8-way 16-Kb cache Ravindran, R., Chu, M., Mahlke, S.: Compiler Managed Partitioned Data Caches for Low Power. In: LCTES 2007 (2007) 8%,16%,25 % energy reduction on a 2-, 4-, 8-partition cache
24
Summary Maintain the advantages of a hardware-cache Expose placement and lookup decisions to the compiler Avoid conflicts, eliminate redundancies Achieve a higher performance and a lower power consumption
25
Future Works Hybrid scratch-pad and caches Develop advance toolchain for newer technology node such as 28nm Incorporate the ability of partitioning data cache into the compiler of the toolchain for the ASIP
26
1.Qadri, Muhammad Yasir, Hemal S. Gujarathi, and Klaus D. McDonald-Maier. "Low Power Processor Architectures and Contemporary Techniques for Power Optimization--A Review." Journal of computers 4.10 (2009). 2.Venkatesh, Ganesh, et al. "Conservation cores: reducing the energy of mature computations." ACM SIGARCH Computer Architecture News. Vol. 38. No. 1. ACM, 2010. 3.Ravindran, R., Chu, M., Mahlke, S.: Compiler Managed Partitioned Data Caches for Low Power. In: LCTES 2007 (2007) 4.K. Roy and M. C. Johnson, Software design for low power, 1996 :NATO Advanced Study Institute on Low Power Design in Deep Submicron Electronics Reference
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.