Download presentation
Presentation is loading. Please wait.
Published byTheodora Hubbard Modified over 8 years ago
1
ELEC 5270/6270 Spring 2015 Low-Power Design of Electronic Circuits Power Aware Microprocessors Copyright Agrawal, 2007ELEC5270/6270 Spr 15, Lecture 81 Vishwani D. Agrawal James J. Danaher Professor Dept. of Electrical and Computer Engineering Auburn University, Auburn, AL 36849 vagrawal@eng.auburn.edu http://www.eng.auburn.edu/~vagrawal/COURSE/E6270_Spr15/course.html
2
SIA Roadmap for Processors (1999) Year199920022005200820112014 Feature size (nm) 180130100705035 Logic transistors/cm 2 6.2M18M39M84M180M390M Clock (GHz) 1.252.13.56.010.016.9 Chip size (mm 2 ) 340430520620750900 Power supply (V) 1.81.51.20.90.60.5 High-perf. Power (W) 90130160170175183 Copyright Agrawal, 2007ELEC5270/6270 Spr 15, Lecture 82 Source: http://www.semichips.orghttp://www.semichips.org Untrue predictions.
3
Power Reduction in Processors Hardware methods: Hardware methods: Voltage reduction for dynamic power Voltage reduction for dynamic power Dual-threshold devices for leakage reduction Dual-threshold devices for leakage reduction Clock gating, frequency reduction Clock gating, frequency reduction Sleep mode Sleep mode Architecture: Architecture: Instruction set Instruction set hardware organization hardware organization Software methods Software methods Copyright Agrawal, 2007ELEC5270/6270 Spr 15, Lecture 83
4
Performance Criteria Throughput – computations per unit time. Throughput – computations per unit time. Performance is inverse of time – increasing CPU time indicates lower performance. Performance is inverse of time – increasing CPU time indicates lower performance. Power – computations per watt. Power – computations per watt. Energy efficiency – performance/joule. Energy efficiency – performance/joule. Copyright Agrawal, 2007ELEC5270/6270 Spr 15, Lecture 84
5
SPEC CPU2006 Benchmarks Standard Performance Evaluation Corporation (SPEC) Standard Performance Evaluation Corporation (SPEC) http://www.spec.org http://www.spec.org http://www.spec.org Twelve integer and 17 floating point programs, CINT2006 and CFP2006. Twelve integer and 17 floating point programs, CINT2006 and CFP2006. Each program run time is normalized to obtain a SPEC ratio with respect to the run time of Sun Ultra Enterprise 2 system with a 296 MHz UltraSPARC II processor. Each program run time is normalized to obtain a SPEC ratio with respect to the run time of Sun Ultra Enterprise 2 system with a 296 MHz UltraSPARC II processor. It takes about 12 days to run all benchmarks on reference system. It takes about 12 days to run all benchmarks on reference system. CINT2006 and CFP2006 metrics are the geometric means of SPEC ratios: CINT2006 and CFP2006 metrics are the geometric means of SPEC ratios: Peak metric – each program is individually optimized (aggressive compilation). Peak metric – each program is individually optimized (aggressive compilation). Base metric – common optimization for all programs. Base metric – common optimization for all programs. Copyright Agrawal, 2007ELEC5270/6270 Spr 15, Lecture 85
6
SPEC CINT2006 Results http://www.spec.org/cpu2006/results/cint2006.html http://www.spec.org/cpu2006/results/cint2006.html http://www.spec.org/cpu2006/results/cint2006.html Dell Inc., PowerEdge R610 Dell Inc., PowerEdge R610 CPU: Intel Xeon X5670, 2.93 GHz CPU: Intel Xeon X5670, 2.93 GHz Number of chips 2, cores 12, threads/core 2 Number of chips 2, cores 12, threads/core 2 Performance metric 36.6 base, 39.4 peak Performance metric 36.6 base, 39.4 peak Dell Inc. PowerEdge M905 Dell Inc. PowerEdge M905 CPU: AMD Opteron 8381 HE, 2.50 GHz CPU: AMD Opteron 8381 HE, 2.50 GHz Number of chips 4, cores 16, threads/core 1 Number of chips 4, cores 16, threads/core 1 Performance metric 15.8 base, 19.1 peak Performance metric 15.8 base, 19.1 peak Copyright Agrawal, 2007ELEC5270/6270 Spr 15, Lecture 86
7
SPEC CFP2006 Results http://www.spec.org/cpu2006/results/cfp2006.html http://www.spec.org/cpu2006/results/cfp2006.html http://www.spec.org/cpu2006/results/cfp2006.html Dell Inc., PowerEdge R610 Dell Inc., PowerEdge R610 CPU: Intel Xeon X5670, 2.93 GHz CPU: Intel Xeon X5670, 2.93 GHz Number of chips 2, cores 12, threads/core 2 Number of chips 2, cores 12, threads/core 2 Performance metric 42.5 base, 45.8 peak Performance metric 42.5 base, 45.8 peak Dell Inc. PowerEdge M905 Dell Inc. PowerEdge M905 CPU: AMD Opteron 8381 HE, 2.50 GHz CPU: AMD Opteron 8381 HE, 2.50 GHz Number of chips 4, cores 16, threads/core 1 Number of chips 4, cores 16, threads/core 1 Performance metric 17.4 base, 21.5 peak Performance metric 17.4 base, 21.5 peak Copyright Agrawal, 2007ELEC5270/6270 Spr 15, Lecture 87
8
Other Benchmarks LINPACK is numerically intensive floating point linear system (Ax = b) program used for benchmarking supercomputers. LINPACK is numerically intensive floating point linear system (Ax = b) program used for benchmarking supercomputers. SPECPOWER_ssj2008 measures power and performance of a computer system. SPECPOWER_ssj2008 measures power and performance of a computer system. The initial benchmark addresses the performance of server-side Java; additional workloads are planned. The initial benchmark addresses the performance of server-side Java; additional workloads are planned. http://www.spec.org/benchmarks.html#power http://www.spec.org/benchmarks.html#power http://www.spec.org/benchmarks.html#power Copyright Agrawal, 2007ELEC5270/6270 Spr 15, Lecture 88
9
Second Quarter 2010 SPECpower_ssj2008 Results http://www.spec.org/power_ssj2008/results/res2010q2/ http://www.spec.org/power_ssj2008/results/res2010q2/ http://www.spec.org/power_ssj2008/results/res2010q2/ Apr 7, 2010: Hewlett-Packard ProLiant DL385 G7 Apr 7, 2010: Hewlett-Packard ProLiant DL385 G7 CPU: AMD Opteron 6174, 2.2GHz CPU: AMD Opteron 6174, 2.2GHz Number of chips 2, cores 12, threads/core 2 Number of chips 2, cores 12, threads/core 2 Total memory 16GB Total memory 16GB ssj operations @ 100% 888,819 ssj operations @ 100% 888,819 Average power @ 100% 271 W Average power @ 100% 271 W Average power @ active idle 101 W Average power @ active idle 101 W Overall ssj operations per watt 2,355 Overall ssj operations per watt 2,355 Copyright Agrawal, 2007ELEC5270/6270 Spr 15, Lecture 89
10
Second Quarter 2010 SPECpower_ssj2008 Results http://www.spec.org/power_ssj2008/results/res2010q2/ http://www.spec.org/power_ssj2008/results/res2010q2/ http://www.spec.org/power_ssj2008/results/res2010q2/ May 19, 2010: Dell Inc., PowerEdge R610 May 19, 2010: Dell Inc., PowerEdge R610 CPU: Intel Xeon X5670, 2.93 GHz CPU: Intel Xeon X5670, 2.93 GHz Number of chips 2, cores 12, threads 2 Number of chips 2, cores 12, threads 2 Total memory 12GB Total memory 12GB ssj operations @ 100% 914,076 ssj operations @ 100% 914,076 Average power @ 100% 244 W Average power @ 100% 244 W Average power @ active idle 62.3 W Average power @ active idle 62.3 W Overall ssj operations per watt 2,938 Overall ssj operations per watt 2,938 Copyright Agrawal, 2007ELEC5270/6270 Spr 15, Lecture 810
11
Energy SPEC Benchmarks Energy efficiency mode: Besides the execution time, energy efficiency of SPEC benchmark programs is also measured. Energy efficiency of a benchmark program is given by: Energy efficiency mode: Besides the execution time, energy efficiency of SPEC benchmark programs is also measured. Energy efficiency of a benchmark program is given by: 1/(Execution time) Energy efficiency = ──────────── Average power Average power D. A. Patterson and J. L. Hennessy, Computer Organization & Design: The Hardware/Software Interface, 4 th Edition, Morgan Kaufmann Publishers (Elsevier), 2009, Copyright Agrawal, 2007ELEC5270/6270 Spr 15, Lecture 811
12
Energy Efficiency Efficiency averaged on n benchmark programs: Efficiency averaged on n benchmark programs: n n Efficiency= ( Π Efficiency i ) 1/n i=1 i=1 where Efficiency i is the efficiency for program i. Relative efficiency: Relative efficiency: Efficiency of a computer Efficiency of a computer Relative efficiency = ───────────────── Eff. of reference computer Eff. of reference computer Copyright Agrawal, 2007ELEC5270/6270 Spr 15, Lecture 812
13
SPEC2000 Relative Energy Efficiency Copyright Agrawal, 2007ELEC5270/6270 Spr 15, Lecture 813 Always max. clock Laptop adaptive clk. Min. power min. clock
14
Voltage Scaling Dynamic: Reduce voltage and frequency during idle or low activity periods. Dynamic: Reduce voltage and frequency during idle or low activity periods. Static: Clustered voltage scaling Static: Clustered voltage scaling Logic on non-critical paths given lower voltage. Logic on non-critical paths given lower voltage. 47% power reduction with 10% area increase reported. 47% power reduction with 10% area increase reported. M. Igarashi et al., “Clustered Voltage Scaling Techniques for Low-Power Design,” Proc. IEEE Symp. Low Power Design, 1997. M. Igarashi et al., “Clustered Voltage Scaling Techniques for Low-Power Design,” Proc. IEEE Symp. Low Power Design, 1997. Copyright Agrawal, 2007ELEC5270/6270 Spr 15, Lecture 814
15
Processor Utilization Copyright Agrawal, 2007ELEC5270/6270 Spr 15, Lecture 815 Throughput = Operations / second Throughput Time Compute-intensive processes System idle Low throughput (background) processes Maximum throughput
16
Examples of Processes Compute-intensive: spreadsheet, spelling check, video decoding, scientific computing. Compute-intensive: spreadsheet, spelling check, video decoding, scientific computing. Low throughput: data entry, screen updates, low bandwidth I/O data transfer. Low throughput: data entry, screen updates, low bandwidth I/O data transfer. Idle: no computation, no expected output. Idle: no computation, no expected output. Copyright Agrawal, 2007ELEC5270/6270 Spr 15, Lecture 816
17
Effects of Voltage Reduction Voltage reduction increases delay, decreases throughput: Voltage reduction increases delay, decreases throughput: Slow reduction in throughput at first Slow reduction in throughput at first Rapid reduction in throughput for V ≤ V Rapid reduction in throughput for V DD ≤ V th Time per operation (TPO) increases Time per operation (TPO) increases Voltage reduction continues to reduce power consumption: Voltage reduction continues to reduce power consumption: Energy per operation (EPO) = Power × TPO Energy per operation (EPO) = Power × TPO Copyright Agrawal, 2007ELEC5270/6270 Spr 15, Lecture 817
18
Energy per Operation (EPO) Copyright Agrawal, 2007ELEC5270/6270 Spr 15, Lecture 818 V / V V DD / V th 1234512345 Power TPO EPO 1.0 0.5 0.0
19
Dynamic Voltage and Clock Throughput Time spent in: Battery life Fast mode Slow mode Idle mode Always full speed 10%0%90% 1 hr Sometimes full speed 1%90%9% 5.3 hrs Rarely full speed 0.1%99%0.9% 9.2 hrs Copyright Agrawal, 2007ELEC5270/6270 Spr 15, Lecture 819 T. D. Burd and R. W. Brodersen, Energy Efficient Microprocessors, Springer, 2002, pp. 35-36.
20
Example: Find Minimum Energy Mode Processor data (rated operation): Processor data (rated operation): 2 GHz clock 2 GHz clock 1.5 volt supply voltage 1.5 volt supply voltage 0.5 volt threshold voltage 0.5 volt threshold voltage Power consumption Power consumption 50 watts dynamic power 50 watts dynamic power 50 watts static power 50 watts static power Maximum clock frequency for V volt supply (alpha-power law): fα(V – V TH )/V Maximum clock frequency for V volt supply (alpha-power law): fα(V – V TH )/V Copyright Agrawal, 2007ELEC5270/6270 Spr 15, Lecture 820
21
Alpha-Power Law Model Variation of delay with supply voltage: Variation of delay with supply voltage: delay α V DD /(V DD – V TH ) α V TH = Threshold voltage V TH = Threshold voltage α = 1 for short-channel devices, ≈ 2 for long-channel devices T. Sakurai and A. R. Newton, “Delay analysis of series-connected MOSFET circuits,” IEEE Journal of Solid-State Circuits, Vol. 26, pp.122–131, Feb. 1991. T. Sakurai and A. R. Newton, “A simple MOSFET model for circuit analysis,” IEEE Transaction on Electron Devices, Vol. 38, No. 4, pp.887–894, Apr. 1991. T. Sakurai, “High-speed circuit design with scaled-down MOSFETs and low supply voltage (invited),” Proc. IEEE ISCAS, pp.1487–1490, Chicago, May 1993. T. Sakurai, “Alpha-Power Law MOS Model,” IEEE Solid-State Circuits Society Newsletter, Vol. 9, No. 4, pp. 4–5, Oct. 2004. Copyright Agrawal, 2007ELEC5270/6270 Spr 15, Lecture 821
22
Example Cont. Dynamic power: Dynamic power: P d = CV 2 f = C(1.5) 2 × 2 × 10 9 = 50W C = 11.11 nF, capacitance switching/cycle P d = 11.11 V 2 f Dynamic energy per cycle: Dynamic energy per cycle: E d = P d /f = 11.11 V 2 Copyright Agrawal, 2007ELEC5270/6270 Spr 15, Lecture 822
23
Example Cont. Clock frequency: Clock frequency: f = k (V – V TH )/V = k (1.5 – 0.5)/1.5 = 2 GHz k = 3 GHz, a proportionality constant f = 3(V – 0.5)/VGHz Copyright Agrawal, 2007ELEC5270/6270 Spr 15, Lecture 823
24
Example Cont. Static power: Static power: P s = k’ V 2 = k’ (1.5) 2 = 50W k’ = 22.22 mho, total leakage conductance P s = 22.22 V 2 Static energy per cycle: Static energy per cycle: E s = P s /f = 22.22 V 3 /[3(V – 0.5)] = 7.41 V 3 /(V – 0.5) Copyright Agrawal, 2007ELEC5270/6270 Spr 15, Lecture 824
25
Example Cont. Total energy per cycle: Total energy per cycle: E = E d + E s = 11.11 V 2 + 7.41 V 3 /(V – 0.5) To minimize E, ∂E/∂V = 0, or To minimize E, ∂E/∂V = 0, or 5V 2 – 4.6V + 0.75 = 0 Solutions of quadratic equation: Solutions of quadratic equation: V = 0.679 volt, 0.221 volt Discard second solution, which is lower than the threshold voltage of 0.5 volt. Discard second solution, which is lower than the threshold voltage of 0.5 volt. Copyright Agrawal, 2007ELEC5270/6270 Spr 15, Lecture 825
26
Example: Result Copyright Agrawal, 2007ELEC5270/6270 Spr 15, Lecture 826 Rated mode Low energy mode Reduction (%) Voltage1.5 V0.679 V54.7% Clock frequency2 GHz791 MHz60% Dynamic energy/cycle25.00 nJ5.12 nJ79.52% Static energy/cycle25.00 nJ12.96 nJ48.16% Total energy/cycle50.0 nJ18.08 nJ63.84% Dynamic power50.0 W4.05 W91.90% Static power50.0 W10.25 W79.50% Total power100.0 W14.20 W85.80%
27
Cycle Efficiency Cycle efficiency is a rating similar to the maximum clock frequency rating. Cycle efficiency is a rating similar to the maximum clock frequency rating. Analogy: Analogy: Cycle efficiency is similar to miles per gallon (mpg) Cycle efficiency is similar to miles per gallon (mpg) Maximum clock frequency is similar to miles per hour (mph) Maximum clock frequency is similar to miles per hour (mph) Reference: A. Shinde and V. D. Agrawal, “Managing Performance and Efficiency of a Processor,” Proc. 45 th IEEE Southeastern Symp. System Theory, March 2013. Reference: A. Shinde and V. D. Agrawal, “Managing Performance and Efficiency of a Processor,” Proc. 45 th IEEE Southeastern Symp. System Theory, March 2013. Copyright Agrawal, 2007ELEC5270/6270 Spr 15, Lecture 827
28
Performance in Time Performance is measured with respect to a program. Performance = Copyright Agrawal, 2007ELEC5270/6270 Spr 15, Lecture 8 28 D. A. Patterson and J. L. Hennessy, Computer Organization & Design, the hardware/Software Interface, Fourth Edition, San Francisco, California: Morgan Kaufman Publishers, Inc., 2008.
29
Performance in Energy (Efficiency) Efficiency is measured with respect to a program. Efficiency Copyright Agrawal, 2007ELEC5270/6270 Spr 15, Lecture 8 29 D. A. Patterson and J. L. Hennessy, Computer Organization & Design, the Hardware/Software Interface, Fourth Edition, San Francisco, California: Morgan Kaufman Publishers, Inc., 2008.
30
Two Performances Time performance Energy performance Copyright Agrawal, 2007ELEC5270/6270 Spr 15, Lecture 830 D. A. Patterson and J. L. Hennessy, Computer Organization & Design, the Hardware/Software Interface, Fourth Edition, San Francisco, California: Morgan Kaufman Publishers, Inc., 2008.
31
Time Performance Speed of a processor is measured in cycles per second or clock frequency (f). Speed of a processor is measured in cycles per second or clock frequency (f). Clock period (1/f) is the time per cycle. Clock period (1/f) is the time per cycle. Execution time of a program using C clock cycles = C/f Execution time of a program using C clock cycles = C/f Time performance = 1/(execution time) Time performance = 1/(execution time) = f/C = f/C Copyright Agrawal, 2007ELEC5270/6270 Spr 15, Lecture 831
32
Energy Performance Energy efficiency of a processor may be measured in cycles per joule or cycle efficiency (η). Energy efficiency of a processor may be measured in cycles per joule or cycle efficiency (η). 1/η is energy per cycle (EPC). 1/η is energy per cycle (EPC). Energy dissipated by a program using C clock cycles = C × EPC = C/η Energy dissipated by a program using C clock cycles = C × EPC = C/η Energy performance = η/C Energy performance = η/C Copyright Agrawal, 2007ELEC5270/6270 Spr 15, Lecture 832
33
Characterizing Device Technology Speed and Efficiency Consider 90nm CMOS technology. Consider 90nm CMOS technology. Use predictive technology model (PTM). Use predictive technology model (PTM). Example circuit: Eight-bit ripple carry adder. Example circuit: Eight-bit ripple carry adder. Nominal voltage = 1.2 volts. Nominal voltage = 1.2 volts. Simulation for varying operating conditions (VDD = 100mV through 1.2V) using Spice: Simulation for varying operating conditions (VDD = 100mV through 1.2V) using Spice: With random vectors for energy per cycle (EPC = 1/η). With random vectors for energy per cycle (EPC = 1/η). With critical path vectors for clock period (1/f). With critical path vectors for clock period (1/f). Reference: W. Zhao and Y. Cao, “New Generation of Predictive Technology Model for Sub-45nm Early Design Exploration,“ IEEE Trans. Electron Devices, vol. 53, no. 11, pp. 2816–2823, 2006. Reference: W. Zhao and Y. Cao, “New Generation of Predictive Technology Model for Sub-45nm Early Design Exploration,“ IEEE Trans. Electron Devices, vol. 53, no. 11, pp. 2816–2823, 2006. Copyright Agrawal, 2007ELEC5270/6270 Spr 15, Lecture 833
34
Energy per Cycle of 8-Bit Adder K. Kim, “Ultra Low Power CMOS Design,” PhD Dissertation, Auburn University, Dept. of ECE, Auburn, Alabama, May 2011. K. Kim, “Ultra Low Power CMOS Design,” PhD Dissertation, Auburn University, Dept. of ECE, Auburn, Alabama, May 2011. Copyright Agrawal, 2007ELEC5270/6270 Spr 15, Lecture 8 34
35
Cycle Time of 8-Bit Adder K. Kim, “Ultra Low Power CMOS Design,” PhD Dissertation, Auburn University, Dept. of ECE, Auburn, Alabama, May 2011. K. Kim, “Ultra Low Power CMOS Design,” PhD Dissertation, Auburn University, Dept. of ECE, Auburn, Alabama, May 2011. Copyright Agrawal, 2007ELEC5270/6270 Spr 15, Lecture 8 35
36
Pentium M processor Published data: H. Hanson, K. Rajamani, S. Keckler, F. Rawson, S. Ghiasi, J. Rubio, “Thermal Response to DVFS: Analysis with an Intel Pentium M,” Proc. International Symp. Low Power Electronics and Design, 2007, pp. 219-224. Published data: H. Hanson, K. Rajamani, S. Keckler, F. Rawson, S. Ghiasi, J. Rubio, “Thermal Response to DVFS: Analysis with an Intel Pentium M,” Proc. International Symp. Low Power Electronics and Design, 2007, pp. 219-224. VDD = 1.2V VDD = 1.2V Maximum clock rate = 1.8GHz Maximum clock rate = 1.8GHz Critical path delay, td = 1/1.8GHz = 555.56ps Critical path delay, td = 1/1.8GHz = 555.56ps Power consumption = 120W Power consumption = 120W EPC = 120/(1.8GHz) = 66.67nJ EPC = 120/(1.8GHz) = 66.67nJ Copyright Agrawal, 2007ELEC5270/6270 Spr 15, Lecture 836
37
Cycle Efficiency and Frequency Copyright Agrawal, 2007ELEC5270/6270 Spr 15, Lecture 8 37
38
Example For a program that executes in 1.8 billion clock cycles. Copyright Agrawal, 2007ELEC5270/6270 Spr 15, Lecture 8 38 Voltage VDD Frequency f MHz Cycle Efficiency, η Execution Time second Total Energy Consumed Power f/η 1.2 V 1800 megacycles/s 15 megacycles/joule 1.0120 Joules120W 0.6 V 277 megacycles/s 70 megacycles/joule 6.525.7 Joules3.96W 200 mV 54.5 megacycles/s 660 megacycles/joule 332.73 Joules0.083W
39
Cycle Efficiency New energy performance rating: Cycle efficiency η; unit is cycles per joule. Clock frequency f in cycles per second is a similar rating for time performance. Similarity to other popular ratings: η → mpg f → mph Two ratings allow effective time and energy management of an electronic system. Copyright Agrawal, 2007ELEC5270/6270 Spr 15, Lecture 8 39
40
Problem of Process Variation in Nanometer Technologies Copyright Agrawal, 2007ELEC5270/6270 Spr 15, Lecture 840 Lower V th V th Higher V th Number of chips Power specification Clock specification From a presentation: Power Reduction using LongRun2 in Transmeta’s Efficon Processor, by D. Ditzel May 17, 2006 Yield loss due to high leakage Yield loss due to slow speed Higher voltage operation Lower voltage operation Nominal voltage
41
Clock Distribution H-Tree Copyright Agrawal, 2007ELEC5270/6270 Spr 15, Lecture 841 clock Fanout, λ = 4 Tree depth, s = log λ N No. of flip-flops = N
42
Clock Network Power Copyright Agrawal, 2007ELEC5270/6270 Spr 15, Lecture 842 P clk = C L V DD 2 f + C L V DD 2 f / λ + C L V DD 2 f / λ 2 +... stages – 1 1 = C L V DD 2 f Σ─ n = 0λ n where C L =total load capacitance of N flip-flops (a flip-flop is assumed similar to a clock buffer) λ =constant fanout at each stage in distribution network Clock consumes about 40% of total processor power, because (1)Clock is always active (2)Makes two transitions per cycle, (α = 2) (3)Clock gating is useful; inhibit clock to unused blocks
43
Upper Bound on Clock Power Copyright Agrawal, 2007ELEC5270/6270 Spr 15, Lecture 843 P clk = C L V DD 2 f + C L V DD 2 f / λ + C L V DD 2 f / λ 2 +... ∞ 1 ≤ C L V DD 2 f Σ─ n = 0λ n ≤ C L V DD 2 f. 1/(1 – 1/ λ) ≤ C L V DD 2 f. λ /(λ – 1) ≤ 1.333 C L V DD 2 f, because λ = 4
44
Properties of H-Tree Balanced clock skew. Balanced clock skew. Small delay and power consumption. Small delay and power consumption. Requires fine-tuning for complex layout. Requires fine-tuning for complex layout. Copyright Agrawal, 2007ELEC5270/6270 Spr 15, Lecture 844
45
Clock Power and Delay Unit size buffer or inverter delay = d Unit size buffer or inverter delay = d Total dynamic power supplied to N flip- flops, P = C L V DD 2 f Total dynamic power supplied to N flip- flops, P = C L V DD 2 f Total power consumption of clock network: Total power consumption of clock network: Copyright Agrawal, 2007ELEC5270/6270 Spr 15, Lecture 845 Flip-flps, NClock power per flip-flopClock delay 1Pd 4P4d 161.25P8d 641.3125P12d 1281.327125P16d
46
Clock Network Examples Alpha 21064 Alpha 21164 Alpha 21264 Technology 0.75μ CMOS 0.5μ CMOS 0.35μ CMOS Frequency (MHz) 200300600 Total capacitance 12.5nF Clock gating used. Total power 80 - 110W Clock load 3.25nF3.75nF Clock power 40% 40% (20W) Max. clock skew 200ps (<10%) 90ps Copyright Agrawal, 2007ELEC5270/6270 Spr 15, Lecture 846 D. W. Bailey and B. J. Benschneider, “Clocking Design and Analysis for a 600-MHz Alpha Microprocessor,” IEEE J. Solid-State Circuits, vol. 33, no. 11, pp. 1627-1633, Nov. 1998.
47
Architecture Level: Pipeline Gating A pipeline processor uses speculative execution. A pipeline processor uses speculative execution. Incorrect branch prediction results in pipeline stalls and wasted energy. Incorrect branch prediction results in pipeline stalls and wasted energy. Idea: Stop fetching instructions if a branch hazard is expected: Idea: Stop fetching instructions if a branch hazard is expected: If the count (M) of incorrect predictions exceeds a pre- specified number (N), then suspend fetching instruction for some k cycles. If the count (M) of incorrect predictions exceeds a pre- specified number (N), then suspend fetching instruction for some k cycles. Ref.: S. Manne, A. Klauser and D. Grunwald, “Pipeline Gating: Speculation Control for Energy Reduction,” Proc. 25 th Annual International Symp. Computer Architecture, June 1998. Ref.: S. Manne, A. Klauser and D. Grunwald, “Pipeline Gating: Speculation Control for Energy Reduction,” Proc. 25 th Annual International Symp. Computer Architecture, June 1998. Copyright Agrawal, 2007ELEC5270/6270 Spr 15, Lecture 847
48
Slack Scheduling Application: Superscalar, out-of-order execution: Application: Superscalar, out-of-order execution: An instruction is executed as soon as the required data and resources become available. An instruction is executed as soon as the required data and resources become available. A commit unit reorders the results. A commit unit reorders the results. Delay the completion of instructions whose result is not immediately needed. Delay the completion of instructions whose result is not immediately needed. Example of RISC instructions: Example of RISC instructions: addr0, r1, r2;(A) addr0, r1, r2;(A) sub r3, r4, r5;(B) sub r3, r4, r5;(B) and r9, r1, r9;(C) and r9, r1, r9;(C) or r5, r9, r10;(D) or r5, r9, r10;(D) xor r2, r5, r11;(E) xor r2, r5, r11;(E) Copyright Agrawal, 2007ELEC5270/6270 Spr 15, Lecture 848 J. Casmira and D. Grunwald, “Dynamic Instruction Scheduling Slack,” Proc. ACM Kool Chips Workshop, Dec. 2000.
49
Slack Scheduling Example Slack scheduling A BC D E Standard scheduling ABC D E Copyright Agrawal, 2007ELEC5270/6270 Spr 15, Lecture 849
50
Slack Scheduling Copyright Agrawal, 2007ELEC5270/6270 Spr 15, Lecture 850 Slack bit Low-power execution units Re-order buffer Scheduling logic
51
Power Reduction Example Alpha 21064: 200MHz @ 3.45V, power dissipation = Alpha 21064: 200MHz @ 3.45V, power dissipation = 26W Reduce voltage to 1.5V, power (x0.189) = Reduce voltage to 1.5V, power (x0.189) = 4.9W Eliminate FP unit, power (x0.33) = Eliminate FP unit, power (x0.33) = 1.6W Scale 0.75μ → 0.35μ, power (x0.5) = Scale 0.75μ → 0.35μ, power (x0.5) = 0.8W Reduce clock load, power (x0.75) = Reduce clock load, power (x0.75) = 0.6W Reduce frequency 200 →160MHz, power (x0.8) = Reduce frequency 200 →160MHz, power (x0.8) = 0.48W J. Montanaro et al., “A 160-MHz, 32-b, 0.5-W CMOS RISC Microprocessor,” IEEE J. Solid-State Circuits, vol. 31, no. 11, pp. 1703-1714, Nov. 1996. J. Montanaro et al., “A 160-MHz, 32-b, 0.5-W CMOS RISC Microprocessor,” IEEE J. Solid-State Circuits, vol. 31, no. 11, pp. 1703-1714, Nov. 1996. Copyright Agrawal, 2007ELEC5270/6270 Spr 15, Lecture 851
52
Why Asynchronous Design? Clock consumes about 40% of total power and limits performance. Clock consumes about 40% of total power and limits performance. Benefits of asynchronous design: Benefits of asynchronous design: Low power: clock power eliminated. Low power: clock power eliminated. Higher performance: clock speed in a pipeline depends on the slowest stage. Higher performance: clock speed in a pipeline depends on the slowest stage. Modularity: modules in a clock-less system operate autonomously. Modularity: modules in a clock-less system operate autonomously. Hurdles: Design, verification, testing, yield. Hurdles: Design, verification, testing, yield. Copyright Agrawal, 2007ELEC5270/6270 Spr 15, Lecture 852
53
Clock Power Copyright Agrawal, 2007ELEC5270/6270 Spr 15, Lecture 853 0.0 0.1 0.2 0.3 0.4 Activity α 1.0 0.8 0.6 0.4 0.2 0.0 Clock Power / Total Power Logic to flip-flop ratio = 0 5 10 20 K. Van Berkel, et al., “Asynchronous Does Not Imply Low Power, But...,” Low-Power CMOS Design, A. P. Chandrakasan and R. Brodersen (Eds.), New York: IEEE Press, 1998, pp. 227-232.
54
Asynchronous Systems No clock. No clock. Self-timed systems: Self-timed systems: Encoded signals Encoded signals Timing signal Timing signal Signaling protocols: Signaling protocols: Sender sends a request Sender sends a request Receiver acknowledges Receiver acknowledges Copyright Agrawal, 2007ELEC5270/6270 Spr 15, Lecture 854
55
GALS: Globally Asynchronous, Locally Synchronous Copyright Agrawal, 2007ELEC5270/6270 Spr 15, Lecture 855 Synchronous module with locally generated clock Self-timed or protocol-driven signals
56
AMULET2e (1996) Copyright Agrawal, 2007ELEC5270/6270 Spr 15, Lecture 856
57
AMULET2e (1996) Asynchronous ARM8. Asynchronous ARM8. 0.5 micron CMOS, 6.4mm × 6.4mm, 3.3V. 0.5 micron CMOS, 6.4mm × 6.4mm, 3.3V. 454k transistors (cache 328k, processor core 93k, control and I/O 33k). 454k transistors (cache 328k, processor core 93k, control and I/O 33k). 150mW at 40MIPS, similar to sync. ARM8. 150mW at 40MIPS, similar to sync. ARM8. 3μW in idle state. 3μW in idle state. http://apt.cs.manchester.ac.uk/projects/pro cessors/amulet/AMULET2_uP.php http://apt.cs.manchester.ac.uk/projects/pro cessors/amulet/AMULET2_uP.php http://apt.cs.manchester.ac.uk/projects/pro cessors/amulet/AMULET2_uP.php http://apt.cs.manchester.ac.uk/projects/pro cessors/amulet/AMULET2_uP.php Copyright Agrawal, 2007ELEC5270/6270 Spr 15, Lecture 857
58
U. Manchester, CS Dept. http://apt.cs.manchester.ac.uk/projects/processors/amulet/AM ULET3i_seminar.pdf http://apt.cs.manchester.ac.uk/projects/processors/amulet/AM ULET3i_seminar.pdf Asynchronous logic: can be competitive with ‘conventional’ designs has particular advantages with low-power and low EMI (think portable systems) may be the only solution to some tasks on big chips especially block interconnections But: designing big systems is a lot of work it’s hard to catch up with the big companies Copyright Agrawal, 2007ELEC5270/6270 Spr 15, Lecture 858
59
References on Async. Design David A. Huffman, The Synthesis of Sequential Switching Circuits, MIT, 1953. Stephen H. Unger, Asynchronous Sequential Switching Circuits, Wiley- Interscience, 1969. Chris J. Myers, Asynchronous Circuit Design, John Wiley & Sons, Inc., 2001. Chris J. Myers, Asynchronous Circuit Design, John Wiley & Sons, Inc., 2001. Copyright Agrawal, 2007ELEC5270/6270 Spr 15, Lecture 859
60
References on Async. Processors S. B. Furber, “Asynchronous Design,” Chapter 7 in Low Power Design in Deep Submicron Electronics, W. Nebel and J. Mermet (Editors), Springer, 1997. S. B. Furber, “Asynchronous Design,” Chapter 7 in Low Power Design in Deep Submicron Electronics, W. Nebel and J. Mermet (Editors), Springer, 1997. A. J. Martin, M. Nyström and C. G. Wong, “Three Generations of Asynchronous Microprocessors,” Caltech, CS Dept., Pasadena, CA, available from http://www.async.caltech.edu/Pubs/PDF/2003_threegen.pdf http://www.async.caltech.edu/Pubs/PDF/2003_threegen.pdf I. E. Sutherland, "Turing Award: Micropipeline," Comm. ACM, vol. 32, no. 6, pp. 720-738, June 1989 http://www.eng.auburn.edu/~vagrawal/COURSE/READING/ARCH/Micro pipeline_sutherland.pdf I. E. Sutherland, "Turing Award: Micropipeline," Comm. ACM, vol. 32, no. 6, pp. 720-738, June 1989 http://www.eng.auburn.edu/~vagrawal/COURSE/READING/ARCH/Micro pipeline_sutherland.pdf http://www.eng.auburn.edu/~vagrawal/COURSE/READING/ARCH/Micro pipeline_sutherland.pdf http://www.eng.auburn.edu/~vagrawal/COURSE/READING/ARCH/Micro pipeline_sutherland.pdf I. E. Sutherland, "The Tyranny of the Clock," Comm. ACM, vol. 55, no. 10, pp. 35-36, Oct 2012 http://www.eng.auburn.edu/~vagrawal/COURSE/READING/ARCH/Suthe rland_Tyranny_o_Clock.pdf I. E. Sutherland, "The Tyranny of the Clock," Comm. ACM, vol. 55, no. 10, pp. 35-36, Oct 2012 http://www.eng.auburn.edu/~vagrawal/COURSE/READING/ARCH/Suthe rland_Tyranny_o_Clock.pdf http://www.eng.auburn.edu/~vagrawal/COURSE/READING/ARCH/Suthe rland_Tyranny_o_Clock.pdf http://www.eng.auburn.edu/~vagrawal/COURSE/READING/ARCH/Suthe rland_Tyranny_o_Clock.pdf Copyright Agrawal, 2007ELEC5270/6270 Spr 15, Lecture 860
61
For More on Microprocessors T. D. Burd and R. W. Brodersen, Energy Efficient Microprocessor Design, Springer, 2002. T. D. Burd and R. W. Brodersen, Energy Efficient Microprocessor Design, Springer, 2002. R. Graybill and R. Melhem, Power Aware Computing, New York: Plenum Publishers, 2002. R. Graybill and R. Melhem, Power Aware Computing, New York: Plenum Publishers, 2002. Copyright Agrawal, 2007ELEC5270/6270 Spr 15, Lecture 861
62
Class Project Assigned April 6, 2015. Assigned April 6, 2015. Clear understanding of the problem expected. Conduct to the point analysis. Reliable (reproducible) data. Meaningful conclusions usable by others. A readable four to six page report (due on 4/27/15) written and formatted like a technical paper (PDF). Include data but do not attach printouts. Copyright Agrawal, 2007ELEC5270/6270 Spr 15, Lecture 862
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.