MicroFix: Exploiting Path-grained Timing Adaptability for Improving Power-Performance Efficiency Guihai Yan, Yinhe Han, Hui Liu, Xiaoyao Liang, Xiaowei Li Key Laboratory of Computer System and Architecture, ICT (Institute of Computing Technology), CAS, Beijing, P.R. China NVIDIA Corporation, USA
Outline What’s Path-grained Timing Adaptability (PTA) Potential of PTA for Efficiency Improvement How to Exploit PTA Case Study Results Conclusions
Impact of DVFS to Path Delay Traditionally, suppose voltage scaling down makes P1 and P2 timing critical, then what? Scaling down frequency to all stages of pipeline Question: Can these emerging critical paths be salvaged to trade more voltage scaling down? Maybe Yes! By fine-grained time stealing
Timing Imbalance Generous Flip-flop (GFF) Backward Adaptable Flip-flop (BAFF) Forward Adaptable Flip-flop (FAFF) Unadaptable Flip-flop (UAFF) Slack_up > TH, Slack_dn > TH Slack_up > TH, Slack_dn ≤ TH Slack_up ≤ TH, Slack_dn > TH Slack_up ≤ TH, Slack_dn ≤ TH
Intrinsic Timing Imbalance Case study FPU, adopted by OpenSPARC T1 Support all IEEE 754 floating-point data types Synthesized by Synopsys Design Compiler with UMC 0.18um technology Cycle period: (1+10%) ×T critical The GFFs, FAFFs, and BAFFs take considerable even dominated proportion! Attractive Potential
DVFS Exacerbating Imbalance Generally, the time margin of longer paths diminish much more faster than that of short ones Assume that the path delay is the sum of delay of gates on the path T G : the gate delay Delta: the delay change during the voltage scaling down Before voltage scaling down △ S 1 = (n - m) × T G After voltage scaling down △ S 2 = (n - m) × (T G + Delta) Define: △ S=|Slack_dn - Slack_up| Slack_dn Slack_up n gates m gates △ S 1 < △ S 2 Example
If the Imbalance be utilized… Check the lower bound of cycle period T Traditionally: T 1 = n× (T G +Delta) From MicroFix’s perspective: T 2 = (m+n)/2 × (T G +Delta) ≤ T 1 - TH Note: preclude the UAFFs
How to deal with UAFFs? Two-supply voltage scheme [Usami, JSSC’98] [Ghosh, TCAD’07] Critical Isolation: the critical paths resulting in UAFFs The supply voltage of Critical Isolation are more conservative than that of other portion out of Critical Isolation. Critical Isolation Powered by Conservative Voltage Powered by Aggressive Voltage The exploitable scope of MicroFix
How to “Fix’’? Two supply voltage scheme Timing sensors [Yan, DATE’09][Agarwal, VTS’07] Multiple-phase Clocks (generated by a DLL)
Operational Principles Ensure that the restored margin ‘v’ and ‘f ’ can guard safe voltage and frequency turning.
Experimental Setup Gate-level Study the adaptability and overhead with a synthesized FPU –Timing info. -> PrimeTime Transistor-level Investigated the Power-Performance tradeoffs with Hspice simulations –32nm PTM models dedicated for HP and LP applications, respectively.
Exploring Design Tradeoffs ‘TH’ play a critical role in determining the ultimate Efficiency Critical Isolation The exploitable scope of MicroFix Critical Isolation The exploitable scope of MicroFix Smaller ‘TH’, smaller CI, but less aggressive voltage reduction! Larger ‘TH’, larger CI, but more aggressive voltage reduction! What ‘TH’ is optimal?
Exploring Design Tradeoffs /2 Percentage of Cells in Critical Isolation
Exploring Design Tradeoffs /3 Sensor Area Overhead a sensor is about 8x that of a pipeline flip-flop (based on the number of transistors) [Yan, DATE09] The paths in the critical isolation and those with ‘over-larger’ slack (i.e. slack >T × TH + t margin ) do not need to be monitored by sensors
Exploring Design Tradeoffs /4 Sensor Power Overhead in the most pessimistic case (TH=0.3, all sensors simultaneously flag timing errors): 14% HOWEVER, such worst-case power overhead can hardly happen due to three reasons 1) Sensors do not need to be always on 2) It’s almost impossible all sensors flag impending timing errors simultaneously 3) TH=0.3 actually is not a optimal configuration Therefore, the pessimistic power overhead won’t offset much efficiency of MicroFix!
Hspice Simulations Object: Investigate the detailed delay-power relation of the target pipeline It is ideal to directly simulate the transistor- level model of the target pipeline with Hspice; however it is very labor-intensive and time consuming. So we took a indirect way to conduct the Hspice simulations P total (V,F) = P comb (V,F)+P ff (V,F) 1/F = T = t c + t setup + t c−to−q
Combinational Component ISCAS85 (c432, c499, c880, c1355, c1908, c2670) 32nm PTM models (HP and LP versions) Normalized V-D and V-P relations comply well with all of the simulated benchmarks!
Sequential Component V-D V-P
Efficiency Comparsion TH = 0.2 is an optimal choice! Efficiency Improvement: 35% EDP, 28% PDP
Conclusion MicroFix can improve DVFS efficiency by exploiting the path-grained adaptability The timing imbalance threshold, TH, implies a critical design tradeoff The efficiency of EDP for HP application up to 35% and PDP for LP application up to 28%, at the expense of only 7% area overhead
Thanks! Q&A