Download presentation
Presentation is loading. Please wait.
1
Chapter 5b Stochastic Circuit Optimization
Prof. Lei He Electrical Engineering Department University of California, Los Angeles URL: eda.ee.ucla.edu
2
Outline Thermal Aware Clock Tree Routing Backgrounds and Motivations Modeling and Problem Formulation Algorithms Experimental Results Conclusions Temperature Aware Microprocessor Floorplanning
3
Clock Tree Synthesis in Synchronous Circuits
Clock signals synchronize data transfer between functional elements in synchronous design Different clock structures exist [Tree, Mesh, Hybrid, etc] Clock skew is the delay difference between two sinks of clock tree Clock skew becomes one of the most significant concerns in clock tree synthesis for high performance designs PLL MEM-ctrll Sys Disp AUDIO VIDEO Source Intel For synchronized designs, data transfer between functional elements are synchronized by clock signals. In terms of topology, clock signal can be delivered by clock tree, clock mesh or hybrid clock network. In this work, we concern ourselves on clock tree synthesis only. One important issue in clock synthesis is Clock skew, which is the maximum difference in the arrival time of a clock signal at two different components. Clock skew forces designers to use a large time period between clock pulses. This makes the system slower. So, in addition to other objectives, clock skew should be minimized during clock routing. The right diagram shows the clock skew vs. clock frequency. The main observation is that as the frequency becomes higher, the skew is more comparable to the frequency. In fact, Clock skew becomes the NO.1 concern in clock tree synthesis for high performance designs.
4
Methodologies for Clock Skew Minimization
The sources of skew Un-balanced clock distribution Process, supply voltage and temperature (PVT) variation Uncertainty from loading Methodologies Active de-skew circuit using micro-controller Passive balanced embedding by CAD algorithms Variation-induced skew needs to be considered! s4 a b s1 s2 s3 s0 v The High-performance design is achieved by DSM or heterogeneous integration. It has two trends. One is to design for high-speed constrained by signal/power/thermal integrity. The other is to design for robustness under process/vdd/temperature variation. They bring the following new challenges for CAD. First, the design for high-speed introduces strong electromagnetic couplings. Second, the design in deep submicron results in a distributed circuit model with large number of nets and ports. Moreover, the integration is usually heterogeneous and hence results in a structured model with multi-physics In addition, the variation and modification introduce a large number of perturbations or parameters to a nominal design. It challenges the circuit level simulation because a detailed verification and automation will never finish. A fast simulator becomes a need. Embedding s0 s1 s3 s4 s2 a b v Topo-Gen
5
Outline Thermal Aware Clock Tree Routing Backgrounds and Motivations Modeling and Problem Formulation Variation Sources: Spatial & Temporal Temperature Correlations Algorithms Experimental Results Conclusions Temperature Aware Microprocessor Floorplanning
6
Spatial Temperature Variation Induced Skew
Spatial variant: Non-uniform power density generates on-chip temperature gradient Clock tree embedding considering the spatial temperature variation: TACO Ignore the time-variant temperature under different workloads Due to the distribution of different functional units, the power density is non-uniform over the chip. The left figure shows the Intel dual-core architecture. The power dissipation for a core is 15 times larger than that for a cache. Such non-uniform power density may cause significant on-chip temperature gradient as shown in the right figure. There is one piece of work presented in 05 considering such spatial temperature variation. However they ignore the temporal variation of the temperature due to different workloads.
7
Temporal Temperature Variation Induced Skew
Significant different temperature maps from two SPEC2000 applications: Ammp, Gzip Dilemma: Optimizing skew for one application hurts the other…. If we apply different applications on the same chip, the temperature maps may be significantly different. We can achieve zero-skew in the left figure by selecting a good layout under the current temperature map, both source-to-sink paths delay values are 7ns. However, for the same clock tree layout, when the application changes, the on-chip temperature will change as well, which make the S->A delay as 2ns while S->B delay as 6ns, and the skew becomes 4ns instead of zero as before. And now we are actually in a dilemma that optimizing clock skew for one application may result very bad skew for the other. In fact, that’s exactly the problem we are trying to solve in this work!
8
Given: To find: Problem Formulation
The source, sinks and an initial embedding of the clock tree Each region is modeled by mean and variance for temperature, and correlation between variations To find: An re-embedding of the clock tree To Minimize the worst case skew under all temperature variations Formally, we formulate this problem as follows. Given the source, sinks and an initial embedding of the clock tree, Each region is modeled by mean and variance for temperature, and correlation (co-variance) between variations. We try to find an re-embedding of the clock tree so that we can minimize the worst case skew under all temperature variations. The figure shows the result for one of our test designs, the black wires are the original clock embedding and red wires show the difference between the re-embedded tree and the original one.
9
Correlations in Temperature Variation
Spatial and Temporal Correlation: Strong correlations exist between temperature for different workloads and different regions on chip Resource sharing between workloads cause temporal correlation Considering temperature correlations during optimization can compress searching space! (i,j) Correlation between area i and j By power-thermal simulation, we extract the correlation between temperature values for different workloads and different regions on chip. The following figure shows the extracted correlation map by a sequence inputs from 6 SPEC2000 applications. The element (i,j) in this map denotes the correlation strength between sub-region i and j under different workloads. We can observe strong correlation between temperature values for different workloads and different regions on chip. In fact, the correlation of temperature variance > 0.8 between most chip regions. By studying correlations, we can reduce the searching space in our algorithm since the same rules can be applied to those tree nodes with strong correlations.
10
Outline Thermal Aware Clock Tree Routing Backgrounds and Motivations Modeling and Problem Formulation Algorithms Experimental Results Conclusions Temperature Aware Microprocessor Floorplanning
11
Re-embedding Process (An example)
y a b c v Perturbation option Sink Let’s first see an example for our perturbation based algorithm. Given a clock tree topology as shown in the left and its embedding in the right. For each merging point, say x here, we consider several perturbation options, for each of which, we calculate the skew after doing such a perturbation Original merging point
12
Re-embedding Process (An example)
y a b c v New merging point
13
Delay, Skew Calculation for Clock Tree
The clock tree is a SIMO linear system Cares impulse responds in each sinks Perturbed Modified Nodal Analysis (MNA) x is for source, sinks and merging point L selects sink responses Defining a new state variable with both nominal (x) and perturbed state variables (Δx) Structured and parameterized state matrix The number of perturbation configurations I=5N is huge! (N is number of merging points)
14
Compressing State Matrix by Temperature Correlation
Motivations Spatial and temporal correlation of the temperature values excludes the need to exhaustively calculate all perturbation combinations Highly correlated merging points should be perturbed in the same fashion Solution Clustering merging points based on correlation strength Perform the same perturbation for all points within one cluster
15
Merging Points Clustering by Temperature Correlation
Objective Given correlation matrix C of them, a low-rank matrix, N >> K Partition N merging points into K clusters Maximize the correlation strength within each of K clusters C
16
Merging Points Clustering by Temperature Correlation
Objective Given correlation matrix C of them, a low-rank matrix, N >> K Partition N merging points into K clusters Decide the clustering number K Singular Value Decomposition (SVD) reveal the real rank (K) information from C Partition the merging points into K clusters K-Means clustering algorithm is employed. Low-Rank Approx. K = 4, N = 70 Reduced from 570 to 54
17
Structural Reduction & Transient Time Analysis
Cluster based reduction (SVD + K-Means) Structural reduction [Hao Yu, DAC’06] Transient time analysis (Back-Euler)
18
Outline Thermal Aware Clock Tree Routing Backgrounds and Motivations Modeling and Problem Formulation Algorithms Experimental Results Conclusions Temperature Aware Microprocessor Floorplanning
19
Experimental Settings
Temperature variation profiles obtained by micro-architecture level power-temperature transient simulator with 6 SPEC2000 applications 100 temperature profiles are collected under every 10 million clock cycles Compare two algorithms: DME method: minimize wire-length for zero-skew under Elmore delay model with nominal temperature Our PECO: minimize skew under a more accurate high-order macromodel with temperature variations
20
Skew Distribution Under 100 temperature maps, and PECO reduces worst-skew and the mean skew
21
Experimental Results (cont.)
PECO reduces the worst-case skew by up to 5X (i.e., for net r5) Skew measured in higher-order delay model considering temperature variations for all applications Skew reduction increases for larger clock nets PECO increases wire-length by less than 1% Runtime Optimization time of PECO is less than DME Model building time is still long but more accurate Note that DME method achieves the optimal wire length under zero-skew constraints for deterministic scenario.
22
The methodologies can be extended to handle
Conclusions Studied the clock optimization for workload dependent temperature variation Reduced the worst-case skew by up to 5X with only 1% wire-length overhead compared to best existing method The methodologies can be extended to handle PVT variations with spatial correlations Other design freedoms such as, floorplanning, power/ground optimization, etc
23
Outline Thermal Aware Clock Tree Routing Temperature Aware Microprocessor Floorplanning Motivation Problem formulation and models Experimental results Conclusion
24
Increased clock needs interconnect pipelining
Motivation Ever increasing integration level and clock rate lead to increased temperature and temperature gradient Extra clock skew and performance degradation Excessive leakage Increased cooling cost Increased clock needs interconnect pipelining Microprocessor floorplan should smooth the temperature gradient and also take into account interconnect pipelining test 2
25
More accurate but far less efficient
Existing Work Quick but not accurate Model temperature by deterministic heat diffusion model No consideration of interconnect pipelining More accurate but far less efficient Calculate temperature for each potential floorplanning No explicit interconnect pipelining 3
26
Outline Thermal Aware Clock Tree Routing Temperature Aware Microprocessor Floorplanning Motivation Problem formulation and models Experimental results Conclusion
27
Find a floorplanning for given soft modules of a microprocessor
Problem Formulation Find a floorplanning for given soft modules of a microprocessor Minimize where CPI is average cycles per instruction 6
28
Less than 3% error compared to cycle accurate uArch simulation
CPI Model Pre-calculate CPI for a number of floorplans based on predicted trajectory in the solution space Table lookup to calculate CPI for a new floorplan by interpolation based on its distance to floorplans with known CPI Less than 3% error compared to cycle accurate uArch simulation 7
29
Deterministic Heat Diffusion Model
The heat diffusion between two modules Mi and Mj and are the average power densities over time The total heat diffusion for module Mi The bigger the heat diffusion is, the smaller the temperature gradient and Tmax are H H (a) (b) 8
30
Recast of Problem Formulation
Find a floorplanning for given soft modules of a microprocessor Minimize 9
31
Primary Limitation of Deterministic Heat Diffusion
(a) Transient temperature is higher when power is positively correlated (b) Transient temperature is lower when power is negatively correlated Average power density ignores power load correlation 10
32
Power Correlation of Alpha-chip in SimpleScalar
(a) Positively correlated (b) uncorrelated 11
33
Calculation of Power Correlation
Treat power for each module as a stochastic process Obtain samples of the above stochastic process for each module as transient power simulated over SPEC2000 benchmarks Compute power correlation between modules as co-variance between the above stochastic processes 12
34
Correlation between Modules
1 Decode 2 Branch 3 RAT 4 RUU 5 LSQ 6 IALU1 7 IALU2 8 IALU3 9 IntReg 10 IL1 11 DL1 12 IALU4 13 FPAdd 14 FPMul 15 FPReg 16 L2_1 17 L2_2 18 L2_3 It shows the correlation matrix for 90nm processor. We can roughly partition all modules into three groups, the first group is from Decode(1) to DL1(11), the second is IALU4(12), which does not have strong correlation to any module, and the last one is from FPAdd(13) to L2 right(18). Modules in the same group are highly positive correlated and the correlations between modules in the different groups are either uncorrelated or negative correlated. We use PTscalar [11] to simulate the power consumption for four integer applications bzip2, gcc, gzip, and mcf and three floating applications art, equake, and mesa in SPEC2000 [12]. With these power vectors, we calculate the mean power density (w/mm2) and standard deviation for each module. We use SA-based PARQUET [10] as our base floorplan solver combined with the CPI model [2] and our stochastic heat diffusion model and run the experiments on a Linux workstation. After completing the whole flow with different objectives, HOTSPOT [5] is used to calculate the temperature for verification purposes only. For each objective, we run ten iterations to acquire the best case and the average case. 13
35
Correlation between Modules
1 Decode 2 Branch 3 RAT 4 RUU 5 LSQ 6 IALU1 7 IALU2 8 IALU3 9 IntReg 10 IL1 11 DL1 12 IALU4 13 FPAdd 14 FPMul 15 FPReg 16 L2_1 17 L2_2 18 L2_3 Correlation between modules and10 is 0.9 It shows the correlation matrix for 90nm processor. We can roughly partition all modules into three groups, the first group is from Decode(1) to DL1(11), the second is IALU4(12), which does not have strong correlation to any module, and the last one is from FPAdd(13) to L2 right(18). Modules in the same group are highly positive correlated and the correlations between modules in the different groups are either uncorrelated or negative correlated. We use PTscalar [11] to simulate the power consumption for four integer applications bzip2, gcc, gzip, and mcf and three floating applications art, equake, and mesa in SPEC2000 [12]. With these power vectors, we calculate the mean power density (w/mm2) and standard deviation for each module. We use SA-based PARQUET [10] as our base floorplan solver combined with the CPI model [2] and our stochastic heat diffusion model and run the experiments on a Linux workstation. After completing the whole flow with different objectives, HOTSPOT [5] is used to calculate the temperature for verification purposes only. For each objective, we run ten iterations to acquire the best case and the average case. 14
36
Other Limitations: It Ignores Dead Space
Floorplan has dead spaces and some modules can diffuse more heat to the dead space. Ex.M1’s temperature is lower in (a) than that in (b) Without considering dead space may lead to higher Temperature. 15
37
Other Limitations: It ignores module geometry
Power density: M1>>M4>M2=M3 M1 has higher temperature in (a) than in (b), since M2’s area is smaller than M3’s area Besides shared length between modules, the depth of the adjacent module also have to be considered. 16
38
Other Limitations: It ignores border effect
Module can diffuse different amount of heat to the border depending on the package design 17
39
Stochastic Heat Diffusion Model
Given m modules, n dead spaces, and power vector Pi=[pi1,…,piT] over T time steps for module Mi Mean power density for module Mi Ai is the area for module Mi, PDi is the transient power density vector, which equals Pi/Ai. E(X) is the expectation value of vector X 18
40
Stochastic Heat Diffusion Model (Cond.)
If the adjacent module Mj or dead space Nj is totally inside the window, we modify PDj to 19
41
Stochastic Heat Diffusion Model (Cond.)
Heat diffusion to the adjacent modules Lij :shared length bewteen Mi and Mj Heat diffusion to the adjacent dead spaces, Cij :shared length between Mi and Nj Heat diffusion to the border Bi :shared length between Mi and the border Con_lateral and Con_adjacent: unit thermal conductance 20
42
Stochastic Heat Diffusion Model (Cond.)
Given m modules, n dead spaces, Power density covariance between Mi and Mj E(PDi,PDj) is the expectation value of PDiPDj over T timesteps The standard deviation of the total heat diffusion for module Mi 21
43
Stochastic Heat Diffusion Model (Cond.)
The total stochastic heat diffusion for Mi Given Z potential hottest modules, the total stochastic heat flow is Wi: weight proportional to 22
44
Outline Thermal Aware Clock Tree Routing Temperature Aware Microprocessor Floorplanning Motivation Problem formulation and models Experimental results Conclusion
45
Implementation and Experiment
The floorplanner uses sequence pair based simulated annealing. Experiments consider SPEC2000 benchmarks One SuperScalar processors for 90nm technology Modules are soft and the aspect ratio is between 0.33 ~3 and L2 is partitioned into three modules uP 90nm Issue Width 4 Die Area (mm2) 100 Die Thickness (mm) 0.5 Heat Spreader (mm2) 900 Heat Sink (mm2) 2500 24
46
Comparison with HotSpot tool
Our model Reduces temperature by up to 3oC with 1.34% increase in area Runs up to 27x faster uP in 90nm Tmax(oC) Area(mm2)(WS) Runtime(s) [JILP’05] 93.0 119.4(4.7%) 2300 Ours 90.0 121.0(5.6%) 85 Impact -3.2% +1.34% 1/27x WS: white space (dead space) percentage EX. 217(4.3%) means the total area is 217 and the white space percentage is 4.3% The impact is the relation between different areas not white spaces EX % means our area is bigger than JILP’05 with an percentage of 1.03 25
47
Impact of Thermal Modeling
uP in 90nm Obj. CPI Tmax(oC) Area(mm2)WS(%) Best Avg AC 0.820 0.890 97.7 96.7 118.5(3.05) 122.4(6.89) ACHd 0.995 +21.3% 1.000 +12.4% 92.0 -5.8% 92.2 -4.7% 122.0(6.67) +2.9% 125.3(9.08) +2.3% ACHs 0.880 +7.3% 0.954 +7.2% 88.8 -9.1% 88.9 -8.1% 121.1(6.10) +2.2% 123.2(7.36) +0.6% Obj: A: area C: CPI Hd: [Han:TACS’05] Hs: Ours In ACHd and ACHw, the last line (showing the percentages ) is the comparison between AC and ACHd, ACHs, respectively 97.7 – 88.8 = 8.9 C 92.0 – 88.8 = 3.2 C (0.995 – 0.880) / = 1.13x Our stochastic thermal model can reduce temperature up to 8.9oC Compared to the thermal-oblivious floorplanner Compared with the deterministic model, our model obtains up to 3.2oC reduction of the on-chip peak temperature, and 1.13x better CPI performance. 26
48
An efficient yet effective thermal-aware uP floorplanning is proposed.
Conclusions A stochastic heat diffusion model is developed to effectively capture correlation between transient power over workload An efficient yet effective thermal-aware uP floorplanning is proposed. 27
49
Reading Assignment Hao Yu, Yu Hu, Chuenchen Liu, and Lei He, "Minimal Skew Clock Embedding Considering Time Variant Temperature Variation Gradient," ACM International Symposium on Physical Design (ISPD) , March 2007. Chun-Ta Chu, Xinyi Zhang, Lei He and Tom Tong Jing, "Temperature Aware Microprocessor Floorplanning Considering Application Dependent Power Load," IEEE/ACM International Conf. on Computer-Aided Design (ICCAD) , 2007.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.