Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott University of Rochester
The gist of the paper… Radical idea: Trade off frequency and hardware complexity dynamically at runtime rather than statically at design time The new twist: A Globally-Asynchronous, Locally-Synchronous (GALS) microarchitecture is key to making this worthwhile
Application phase behavior Varying behavior over time [Sherwood, Sair, Calder, ISCA 2003] Can exploit to save power gcc L2 misses IPC L1I misses L1D misses branch mispred E per interval [Buyuktosunoglu, et al., GLSVLSI 2001] adaptive issue queue
What about performance? Lower power and faster access time! entriesrelative delay RAM delay entriesrelative delay CAM delay [Buyuktosunoglu, GLSVLSI 2001]
What about performance? How do we exploit the faster speed? Variable latency Increase frequency when downsizing Decrease frequency when upsizing
What about performance? [Albonesi, ISCA 1998] Issue Queue ALUs & RF L1 I-Cache Dispatch, Rename, ROB Fetch Unit Issue Queue Main Memory L2 Cache Ld/St Unit L1 D-Cache clock Br Pred ALUs & RF FPinteger
What about performance? [Albonesi, ISCA 1998]
Enter GALS… Issue Queue ALUs & RF L1 I-Cache Dispatch, Rename, ROB Fetch Unit Issue Queue ALUs & RF Main Memory L2 Cache Ld/St Unit Integer DomainFP Domain Memory Domain Front-end DomainExternal Domain Br Pred L1 D-Cache [Semeraro et al., HPCA 2002] [Iyer and Marculescu, ISCA 2002]
Outline Motivation and background Adaptive GALS microarchitecture Control mechanisms Evaluation methodology Results Conclusions and future work
Adaptive GALS microarchitecture Br Pred L1 I-Cache L2 Cache L1 D-Cache Issue Queue ALUs & RF L1 I-Cache Dispatch, Rename, ROB Fetch Unit ALUs & RF Main Memory L2 Cache Ld/St Unit L1 D-Cache Integer DomainFP Domain Memory Domain Front-end Domain External Domain Issue Queue Br Pred
Adaptive GALS operation Br Pred L1 I-Cache L2 Cache L1 D-Cache Issue Queue ALUs & RF Dispatch, Rename, ROB L1 I-Cache Fetch Unit ALUs & RF Main Memory L2 Cache Ld/St Unit L1 D-Cache Integer DomainFP Domain Memory Domain Front-end Domain External Domain Issue Queue Br Pred L1 I-Cache
Resizable cache organization Access A part first, then B part on a miss Swap A and B blocks on a A miss, B hit Select A/B split according to application phase behavior
Resizable cache control A MRU State (LRU)(MRU) MRU[1]++ MRU[2]++ MRU[0]++ MRU[3]++ Example Accesses Config A1 B3 hits A = MRU[0] hits B = MRU[1] + [2] + [3] Config A2 B2 hits A = MRU[0] + [1] hits B = MRU[2] + [3] Config A3 B1 hits A = MRU[0] + [1] + [2] hits B = MRU[3] Config A4 B0 hits A = MRU[0] + [1] + [2] + [3] hits B = BCD ABCD BCAD BCAD Calculate the cost for each possible configuration: A access costs = (hits A + hits B + misses) * Cost A B access costs = (hits B + misses) * Cost B Miss access costs = misses * Cost Miss Total access cost = A + B + Miss (normalized to frequency)
Resizable issue queue control Measures the exploitable ILP for each queue size Timestamp counter is reset at the start of an interval and incremented each cycle During rename, a destination register is given a timestamp based on the timestamp + execution latency of its slowest source operand The maximum timestamp, MAXN is maintained for each of the four possible queue sizes over N fetched instructions (N=16, 32, 48, 64) ILP is estimated as N/MAXN Queue size with highest ILP (normalized to frequency) is selected Read the paper
Resizable hardware – some details Front end domain Icache “A”: 16KB 1-way, 32KB 2-way, 48KB 3-way, 64KB 4-way Branch predictor sized with Icache – gshare PHT: 16KB-64KB – Local BHT: 2KB-8KB – Local PHT: 1024 entries – Meta: 16KB-64KB Load/store domain Dcache “A”: 32KB 1-way, 64KB 2-way, 128KB 4-way, 256KB, 8- way L2 cache “A” sized with Dcache – 256KB 1-way, 512KB 2-way, 1MB 4-way, 2MB 8-way Integer and floating point domains Issue queue: 16, 32, 48, or 64 entries
Evaluation methodology SimpleScalar and Cacti 40 benchmarks from SPEC, Mediabench, and Olden Baseline: best overall performing fully synchronous like design found out of 1,024 simulated options Adaptive MCD costs imposed: Additional branch penalty of 2 integer domain cycles and 1 front end domain cycle (overpipelined) Frequency penalty as much as 31% Mean PLL locking time of 15 µsec Program-Adaptive: profile application and pick the best adaptive configuration for the whole program Phase-Adaptive: use online cache and issue queue control mechanisms
Performance improvement MediabenchOldenSPEC
Phase behavior – art issue queue entries 100 million instruction window
Phase behavior – apsi Dcache “A” size 32KB 128KB 64KB 256KB 100 million instruction window
Performance summary Program Adaptive: 17% performance improvement Phase Adaptive: 20% performance improvement Automatic Never degrades performance for 40 applications Few phases in chosen application windows – could perhaps do better Distribution of chosen configurations for Program Adaptive: Integer IQFP IQD/L2 CacheIcache 1685% 325% 485% 645% 32KB/256KB50% 64KB/512KB18% 128KB/1MB23% 256KB/2MB10% 16KB55% 32KB18% 48KB8% 64KB20% 1673% 3215% 488% 645%
Domain frequency versus IQ size
Conclusions Application phase behavior can be exploited to improve performance in addition to power savings GALS approach is key to localizing the impact of slowing the clock Cache and queue control mechanisms can evaluate all possible configurations within a single interval Phase adaptive approach improves performance by as much as 48% and by an average of 20%
Future work Explore multiple adaptive structures in each domain Better take into account the branch predictor Resize the instruction cache by sets rather than ways Explore better issue queue design alternatives Build circuits Dynamically customized heterogeneous multi-core architectures using phase-adaptive GALS cores
Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott University of Rochester