Presentation is loading. Please wait.

Presentation is loading. Please wait.

Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.

Similar presentations


Presentation on theme: "Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott."— Presentation transcript:

1 Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott University of Rochester

2 The gist of the paper… Radical idea: Trade off frequency and hardware complexity dynamically at runtime rather than statically at design time The new twist: A Globally-Asynchronous, Locally-Synchronous (GALS) microarchitecture is key to making this worthwhile

3 Application phase behavior Varying behavior over time [Sherwood, Sair, Calder, ISCA 2003] Can exploit to save power gcc L2 misses IPC L1I misses L1D misses branch mispred E per interval [Buyuktosunoglu, et al., GLSVLSI 2001] adaptive issue queue

4 What about performance? Lower power and faster access time! entriesrelative delay 32 24 16 8 1.0 0.77 0.52 0.31 RAM delay entriesrelative delay 32 24 26 8 1.0 0.77 0.55 0.34 CAM delay [Buyuktosunoglu, GLSVLSI 2001]

5 What about performance? How do we exploit the faster speed? Variable latency Increase frequency when downsizing Decrease frequency when upsizing

6 What about performance? [Albonesi, ISCA 1998] Issue Queue ALUs & RF L1 I-Cache Dispatch, Rename, ROB Fetch Unit Issue Queue Main Memory L2 Cache Ld/St Unit L1 D-Cache clock Br Pred ALUs & RF FPinteger

7 What about performance? [Albonesi, ISCA 1998]

8 Enter GALS… Issue Queue ALUs & RF L1 I-Cache Dispatch, Rename, ROB Fetch Unit Issue Queue ALUs & RF Main Memory L2 Cache Ld/St Unit Integer DomainFP Domain Memory Domain Front-end DomainExternal Domain Br Pred L1 D-Cache [Semeraro et al., HPCA 2002] [Iyer and Marculescu, ISCA 2002]

9 Outline  Motivation and background  Adaptive GALS microarchitecture  Control mechanisms  Evaluation methodology  Results  Conclusions and future work

10 Adaptive GALS microarchitecture Br Pred L1 I-Cache L2 Cache L1 D-Cache Issue Queue ALUs & RF L1 I-Cache Dispatch, Rename, ROB Fetch Unit ALUs & RF Main Memory L2 Cache Ld/St Unit L1 D-Cache Integer DomainFP Domain Memory Domain Front-end Domain External Domain Issue Queue Br Pred

11 Adaptive GALS operation Br Pred L1 I-Cache L2 Cache L1 D-Cache Issue Queue ALUs & RF Dispatch, Rename, ROB L1 I-Cache Fetch Unit ALUs & RF Main Memory L2 Cache Ld/St Unit L1 D-Cache Integer DomainFP Domain Memory Domain Front-end Domain External Domain Issue Queue Br Pred L1 I-Cache

12 Resizable cache organization  Access A part first, then B part on a miss  Swap A and B blocks on a A miss, B hit  Select A/B split according to application phase behavior

13 Resizable cache control A MRU State (LRU)(MRU) MRU[1]++ MRU[2]++ MRU[0]++ MRU[3]++ Example Accesses Config A1 B3 hits A = MRU[0] hits B = MRU[1] + [2] + [3] Config A2 B2 hits A = MRU[0] + [1] hits B = MRU[2] + [3] Config A3 B1 hits A = MRU[0] + [1] + [2] hits B = MRU[3] Config A4 B0 hits A = MRU[0] + [1] + [2] + [3] hits B = 0 1230 BCD ABCD BCAD BCAD Calculate the cost for each possible configuration: A access costs = (hits A + hits B + misses) * Cost A B access costs = (hits B + misses) * Cost B Miss access costs = misses * Cost Miss Total access cost = A + B + Miss (normalized to frequency)

14 Resizable issue queue control  Measures the exploitable ILP for each queue size  Timestamp counter is reset at the start of an interval and incremented each cycle  During rename, a destination register is given a timestamp based on the timestamp + execution latency of its slowest source operand  The maximum timestamp, MAXN is maintained for each of the four possible queue sizes over N fetched instructions (N=16, 32, 48, 64)  ILP is estimated as N/MAXN  Queue size with highest ILP (normalized to frequency) is selected Read the paper

15 Resizable hardware – some details  Front end domain Icache “A”: 16KB 1-way, 32KB 2-way, 48KB 3-way, 64KB 4-way Branch predictor sized with Icache – gshare PHT: 16KB-64KB – Local BHT: 2KB-8KB – Local PHT: 1024 entries – Meta: 16KB-64KB  Load/store domain Dcache “A”: 32KB 1-way, 64KB 2-way, 128KB 4-way, 256KB, 8- way L2 cache “A” sized with Dcache – 256KB 1-way, 512KB 2-way, 1MB 4-way, 2MB 8-way  Integer and floating point domains Issue queue: 16, 32, 48, or 64 entries

16 Evaluation methodology  SimpleScalar and Cacti  40 benchmarks from SPEC, Mediabench, and Olden  Baseline: best overall performing fully synchronous 21264-like design found out of 1,024 simulated options  Adaptive MCD costs imposed: Additional branch penalty of 2 integer domain cycles and 1 front end domain cycle (overpipelined) Frequency penalty as much as 31%  Mean PLL locking time of 15 µsec  Program-Adaptive: profile application and pick the best adaptive configuration for the whole program  Phase-Adaptive: use online cache and issue queue control mechanisms

17 Performance improvement MediabenchOldenSPEC

18 Phase behavior – art issue queue entries 100 million instruction window

19 Phase behavior – apsi Dcache “A” size 32KB 128KB 64KB 256KB 100 million instruction window

20 Performance summary  Program Adaptive: 17% performance improvement  Phase Adaptive: 20% performance improvement Automatic Never degrades performance for 40 applications Few phases in chosen application windows – could perhaps do better  Distribution of chosen configurations for Program Adaptive: Integer IQFP IQD/L2 CacheIcache 1685% 325% 485% 645% 32KB/256KB50% 64KB/512KB18% 128KB/1MB23% 256KB/2MB10% 16KB55% 32KB18% 48KB8% 64KB20% 1673% 3215% 488% 645%

21 Domain frequency versus IQ size

22 Conclusions  Application phase behavior can be exploited to improve performance in addition to power savings  GALS approach is key to localizing the impact of slowing the clock  Cache and queue control mechanisms can evaluate all possible configurations within a single interval  Phase adaptive approach improves performance by as much as 48% and by an average of 20%

23 Future work  Explore multiple adaptive structures in each domain  Better take into account the branch predictor  Resize the instruction cache by sets rather than ways  Explore better issue queue design alternatives  Build circuits  Dynamically customized heterogeneous multi-core architectures using phase-adaptive GALS cores

24 Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott University of Rochester


Download ppt "Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott."

Similar presentations


Ads by Google