Download presentation
Presentation is loading. Please wait.
Published byAriel Sullivan Modified over 9 years ago
1
Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott University of Rochester
2
The gist of the paper… Radical idea: Trade off frequency and hardware complexity dynamically at runtime rather than statically at design time The new twist: A Globally-Asynchronous, Locally-Synchronous (GALS) microarchitecture is key to making this worthwhile
3
Application phase behavior Varying behavior over time [Sherwood, Sair, Calder, ISCA 2003] Can exploit to save power gcc L2 misses IPC L1I misses L1D misses branch mispred E per interval [Buyuktosunoglu, et al., GLSVLSI 2001] adaptive issue queue
4
What about performance? Lower power and faster access time! entriesrelative delay 32 24 16 8 1.0 0.77 0.52 0.31 RAM delay entriesrelative delay 32 24 26 8 1.0 0.77 0.55 0.34 CAM delay [Buyuktosunoglu, GLSVLSI 2001]
5
What about performance? How do we exploit the faster speed? Variable latency Increase frequency when downsizing Decrease frequency when upsizing
6
What about performance? [Albonesi, ISCA 1998] Issue Queue ALUs & RF L1 I-Cache Dispatch, Rename, ROB Fetch Unit Issue Queue Main Memory L2 Cache Ld/St Unit L1 D-Cache clock Br Pred ALUs & RF FPinteger
7
What about performance? [Albonesi, ISCA 1998]
8
Enter GALS… Issue Queue ALUs & RF L1 I-Cache Dispatch, Rename, ROB Fetch Unit Issue Queue ALUs & RF Main Memory L2 Cache Ld/St Unit Integer DomainFP Domain Memory Domain Front-end DomainExternal Domain Br Pred L1 D-Cache [Semeraro et al., HPCA 2002] [Iyer and Marculescu, ISCA 2002]
9
Outline Motivation and background Adaptive GALS microarchitecture Control mechanisms Evaluation methodology Results Conclusions and future work
10
Adaptive GALS microarchitecture Br Pred L1 I-Cache L2 Cache L1 D-Cache Issue Queue ALUs & RF L1 I-Cache Dispatch, Rename, ROB Fetch Unit ALUs & RF Main Memory L2 Cache Ld/St Unit L1 D-Cache Integer DomainFP Domain Memory Domain Front-end Domain External Domain Issue Queue Br Pred
11
Adaptive GALS operation Br Pred L1 I-Cache L2 Cache L1 D-Cache Issue Queue ALUs & RF Dispatch, Rename, ROB L1 I-Cache Fetch Unit ALUs & RF Main Memory L2 Cache Ld/St Unit L1 D-Cache Integer DomainFP Domain Memory Domain Front-end Domain External Domain Issue Queue Br Pred L1 I-Cache
12
Resizable cache organization Access A part first, then B part on a miss Swap A and B blocks on a A miss, B hit Select A/B split according to application phase behavior
13
Resizable cache control A MRU State (LRU)(MRU) MRU[1]++ MRU[2]++ MRU[0]++ MRU[3]++ Example Accesses Config A1 B3 hits A = MRU[0] hits B = MRU[1] + [2] + [3] Config A2 B2 hits A = MRU[0] + [1] hits B = MRU[2] + [3] Config A3 B1 hits A = MRU[0] + [1] + [2] hits B = MRU[3] Config A4 B0 hits A = MRU[0] + [1] + [2] + [3] hits B = 0 1230 BCD ABCD BCAD BCAD Calculate the cost for each possible configuration: A access costs = (hits A + hits B + misses) * Cost A B access costs = (hits B + misses) * Cost B Miss access costs = misses * Cost Miss Total access cost = A + B + Miss (normalized to frequency)
14
Resizable issue queue control Measures the exploitable ILP for each queue size Timestamp counter is reset at the start of an interval and incremented each cycle During rename, a destination register is given a timestamp based on the timestamp + execution latency of its slowest source operand The maximum timestamp, MAXN is maintained for each of the four possible queue sizes over N fetched instructions (N=16, 32, 48, 64) ILP is estimated as N/MAXN Queue size with highest ILP (normalized to frequency) is selected Read the paper
15
Resizable hardware – some details Front end domain Icache “A”: 16KB 1-way, 32KB 2-way, 48KB 3-way, 64KB 4-way Branch predictor sized with Icache – gshare PHT: 16KB-64KB – Local BHT: 2KB-8KB – Local PHT: 1024 entries – Meta: 16KB-64KB Load/store domain Dcache “A”: 32KB 1-way, 64KB 2-way, 128KB 4-way, 256KB, 8- way L2 cache “A” sized with Dcache – 256KB 1-way, 512KB 2-way, 1MB 4-way, 2MB 8-way Integer and floating point domains Issue queue: 16, 32, 48, or 64 entries
16
Evaluation methodology SimpleScalar and Cacti 40 benchmarks from SPEC, Mediabench, and Olden Baseline: best overall performing fully synchronous 21264-like design found out of 1,024 simulated options Adaptive MCD costs imposed: Additional branch penalty of 2 integer domain cycles and 1 front end domain cycle (overpipelined) Frequency penalty as much as 31% Mean PLL locking time of 15 µsec Program-Adaptive: profile application and pick the best adaptive configuration for the whole program Phase-Adaptive: use online cache and issue queue control mechanisms
17
Performance improvement MediabenchOldenSPEC
18
Phase behavior – art issue queue entries 100 million instruction window
19
Phase behavior – apsi Dcache “A” size 32KB 128KB 64KB 256KB 100 million instruction window
20
Performance summary Program Adaptive: 17% performance improvement Phase Adaptive: 20% performance improvement Automatic Never degrades performance for 40 applications Few phases in chosen application windows – could perhaps do better Distribution of chosen configurations for Program Adaptive: Integer IQFP IQD/L2 CacheIcache 1685% 325% 485% 645% 32KB/256KB50% 64KB/512KB18% 128KB/1MB23% 256KB/2MB10% 16KB55% 32KB18% 48KB8% 64KB20% 1673% 3215% 488% 645%
21
Domain frequency versus IQ size
22
Conclusions Application phase behavior can be exploited to improve performance in addition to power savings GALS approach is key to localizing the impact of slowing the clock Cache and queue control mechanisms can evaluate all possible configurations within a single interval Phase adaptive approach improves performance by as much as 48% and by an average of 20%
23
Future work Explore multiple adaptive structures in each domain Better take into account the branch predictor Resize the instruction cache by sets rather than ways Explore better issue queue design alternatives Build circuits Dynamically customized heterogeneous multi-core architectures using phase-adaptive GALS cores
24
Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott University of Rochester
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.