Download presentation
Presentation is loading. Please wait.
1
2/8/06D&T Seminar1 Multi-Core Parallelism for Low- Power Design Vishwani D. Agrawal James J. Danaher Professor Department of Electrical and Computer Engineering Auburn University http://www.eng.auburn.edu/~vagrawal vagrawal@eng.auburn.edu
2
2/8/06D&T Seminar2 Power Consumption of VLSI Chips Why is it a concern?
3
2/8/06D&T Seminar3 SIA Roadmap for Processors (1999) Year199920022005200820112014 Feature size (nm)180130100705035 Logic transistors/cm 2 6.2M18M39M84M180M390M Clock (GHz)1.252.13.56.010.016.9 Chip size (mm 2 )340430520620750900 Power supply (V)1.81.51.20.90.60.5 High-perf. Power (W)90130160170175183 Source: http://www.semichips.orghttp://www.semichips.org
4
2/8/06D&T Seminar4 ISSCC, Feb. 2001, Keynote “Ten years from now, microprocessors will run at 10GHz to 30GHz and be capable of processing 1 trillion operations per second -- about the same number of calculations that the world's fastest supercomputer can perform now. “Unfortunately, if nothing changes these chips will produce as much heat, for their proportional size, as a nuclear reactor....” Patrick P. Gelsinger Senior Vice President General Manager Digital Enterprise Group INTEL CORP.
5
2/8/06D&T Seminar5 VLSI Chip Power Density 4004 8008 8080 8085 8086 286 386 486 Pentium® P6 1 10 100 1000 10000 19701980199020002010 Year Power Density (W/cm 2 ) Hot Plate Nuclear Reactor Rocket Nozzle Sun’s Surface Source: Intel
6
2/8/06D&T Seminar6 Power Dissipation in CMOS Logic (0.25µ) %75%5%20 P total (0→1) = C L V DD 2 + t sc V DD I peak + V DD I leakage CLCL V DD
7
2/8/06D&T Seminar7 Low-Power Datapath Architecture Lower supply voltage –This slows down circuit speed –Use parallel computing to gain the speed back Works well when threshold voltage is also lowered. About 60% reduction in power obtainable. Reference: A. P. Chandrakasan and R. W. Brodersen, Low Power Digital CMOS Design, Boston: Kluwer Academic Publishers (Now Springer), 1995.
8
2/8/06D&T Seminar8 A Reference Datapath Combinational logic Output Input Register CK Supply voltage= V ref Total capacitance switched per cycle= C ref Clock frequency= f Power consumption:P ref = C ref V ref 2 f C ref
9
2/8/06D&T Seminar9 A Parallel Architecture Comb. Logic Copy 1 Comb. Logic Copy 2 Comb. Logic Copy N Register N to 1 multiplexer Multiphase Clock gen. and mux control Input Output CK f f/N A copy processes every Nth input, operates at reduced voltage Supply voltage: V N ≤ V 1 = V ref N = Deg. of parallelism
10
2/8/06D&T Seminar10 Control Signals, N = 4 CK Phase 1 Phase 2 Phase 3 Phase 4
11
2/8/06D&T Seminar11 Power P N =P proc + P overhead P proc =N(C inreg + C comb )V N 2 f/N + C outreg V N 2 f =(C inreg + C comb +C outreg )V N 2 f =C ref V N 2 f P overhead =C overhead V N 2 f≈ δC ref (N – 1)V N 2 f P N = [1 + δ(N – 1)]C ref V N 2 f P N V N 2 ──= [1 + δ(N – 1)] ─── P 1 V ref 2
12
2/8/06D&T Seminar12 Voltage vs. Speed C L V ref C L V ref Delay of a gate, T ≈ ──── = ────────── Ik(W/L)(V ref – V t ) 2 whereI is saturation current k is a technology parameter W/L is width to length ratio of transistor V t is threshold voltage Supply voltage Normalized gate delay, T 4.0 3.0 2.0 1.0 0.0 VtVt V ref =5VV 2 =2.9V N=1 N=2 V3V3 N=3 1.2μ CMOS Voltage reduction slows down as we get closer to V t
13
2/8/06D&T Seminar13 Increasing Multiprocessing P N /P 1 1 2 3 4 5 6 7 8 9 10 11 12 1.0 0.8 0.6 0.4 0.2 0.0 V t =0V (extreme case) V t =0.4V V t =0.8V N 1.2μ CMOS, V ref = 5V
14
2/8/06D&T Seminar14 Extreme Cases: V t = 0 Delay, T α 1/ V ref For N processing elements, delay = NT → V N = V ref /N P N 1 ──=[1+ δ (N – 1)] ──→1/N P 1 N 2 For negligible overhead, δ→0 P N 1 ──≈── P 1 N 2 For V t > 0, power reduction is less and there will be an optimum value of N.
15
2/8/06D&T Seminar15 Example: Multiplier Core Specification: 200MHz Clock 15W dissipation @ 5V Low voltage operation, V DD ≥ 1.5 volts (V DD – 0.5) 2 Relative clock rate = ─────── 20.25 Problem: Integrate multiplier core on a SOC Power budget for multiplier ~ 5W
16
2/8/06D&T Seminar16 A Multicore Design Multiplier Core 1 Multiplier Core 5 Reg 5 to 1 mux Multiphase Clock gen. and mux control Input Output 200MHz CK 200MHz 40MHz Multiplier Core 2 Core clock frequency = 200/N, N should divide 200.
17
2/8/06D&T Seminar17 How Many Cores? For N cores: clock frequency = 200/N MHz Supply voltage, V DDN = 0.5 + (20.25/N) 1/2 Volts Assuming 10% overhead per core, V DDN Power dissipation =15 [1 + 0.1(N – 1)] ( ─── ) 2 watts 5
18
2/8/06D&T Seminar18 Design Tradeoffs Number of cores N Clock (MHz) Core supply VDDN (Volts) Total Power (Watts) 12005.0015.0 21003.688.94 4502.755.90 5402.515.29 8252.104.50
19
2/8/06D&T Seminar19 Power Reduction in Processors Just about everything is used. Hardware methods: Voltage reduction for dynamic power Dual-threshold devices for leakage reduction Clock gating, frequency reduction Sleep mode Architecture: Instruction set hardware organization Software methods
20
2/8/06D&T Seminar20 Parallel Architecture Processor f f/2 Processor f/2 f Input Output Input Output Capacitance = C Voltage = V Frequency = f Power = CV 2 f Capacitance = 2.2C Voltage = 0.6V Frequency = 0.5f Power = 0.396CV 2 f
21
2/8/06D&T Seminar21 Pipeline Architecture Processor f Input Output Register ½ Proc. f InputOutput Register ½ Proc. Register Capacitance = C Voltage = V Frequency = f Power = CV 2 f Capacitance = 1.2C Voltage = 0.6V Frequency = f Power = 0.432CV 2 f
22
2/8/06D&T Seminar22 Approximate Trend n-parallel proc. n-stage pipeline proc. CapacitancenCC VoltageV/n Frequencyf/nf PowerCV 2 f/n 2 Chip area n times10-20% increase G. K. Yeap, Practical Low Power Digital VLSI Design, Boston: Kluwer Academic Publishers, 1998.
23
2/8/06D&T Seminar23 Multicore Processors 200020042008 Performance based on SPECint2000 and SPECfp2000 benchmarks Multicore Single core Computer, May 2005, p. 12
24
2/8/06D&T Seminar24 Multicore Processors D. Geer, “Chip Makers Turn to Multicore Processors,” Computer, vol. 38, no. 5, pp. 11-13, May 2005. A. Jerraya, H. Tenhunen and W. Wolf, “Multiprocessor Systems-on-Chips,” Computer, vol. 5, no. 7, pp. 36-40, July 2005; this special issue contains three more articles on multicore processors. S. K. Moore, “Winner Multimedia Monster – Cell’s Nine Processors Make It a Supercomputer on a Chip,” IEEE Spectrum, vol. 43. no. 1, pp. 20-23, January 2006.
25
2/8/06D&T Seminar25 Cell - Cell Broadband Engine Architecture L to R Atsushi Kameyama, Toshiba James Kahle, IBM Masakazu Suzoki, Sony © IEEE Spectrum, January 2006 Nine-processor chip: 192 Gflops
26
2/8/06D&T Seminar26 Cell’s Nine-Processor Chip © IEEE Spectrum, January 2006 Eight Identical Processors f = 5.6GHz (max) 44.8 Gflops
27
2/8/06D&T Seminar27 ?
28
2/8/06D&T Seminar28 Amdahl’s Law S P = 1 – S 01time 1 Speedup =───────── S + (1 – S)/ N Where N =number of parallel processors Example:S = 0.6, N = 10, Speedup = 1.56 S = 0.6, N = ∞, Speedup = 1.67 Gene Amdahl, “Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities,” AFIPS Conference Proceedings, (30), pp. 483-485, 1967.
29
2/8/06D&T Seminar29 Question Can we find a multi-processing law –for power reduction, or –for performance per watt
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.