Presentation is loading. Please wait.

Presentation is loading. Please wait.

Performance, Energy and Thermal Considerations of SMT and CMP architectures Yingmin Li, David Brooks, Zhigang Hu, Kevin Skadron Dept. of Computer Science,

Similar presentations


Presentation on theme: "Performance, Energy and Thermal Considerations of SMT and CMP architectures Yingmin Li, David Brooks, Zhigang Hu, Kevin Skadron Dept. of Computer Science,"— Presentation transcript:

1 Performance, Energy and Thermal Considerations of SMT and CMP architectures Yingmin Li, David Brooks, Zhigang Hu, Kevin Skadron Dept. of Computer Science, University of Virginia Division of Engineering and Applied Sciences, Havard University IBM T.J.Watson Research Center

2 2 © 2005, Yingmin Li Motivation Future trend calls for multi-core and multi-thread architectures Which is better: lots of tiny speed demons or fewer brainiacs? Which is more valuable, more L2 or additional cores? Performance, power, and thermal properties of multi-core vs. multi-thread architectures not well understood In-order Processor Out-of- order Processor CMP with out-of- order Cores CMP with out-of-order SMT cores Sun Niagara Single Thread Performance note: not to scale 124 #threads per chip Equal performance curve?

3 3 © 2005, Yingmin Li Scope of this Study Equal-area comparison between SMT vs. CMP extensions of an Apple G5-like core Note: 1MB L2 roughly equals to 1 G5 like Core in terms of area Single- threaded SMT Single- threaded CMP

4 4 © 2005, Yingmin Li Outline Modeling / Model Validation SMT vs. CMP performance, power and thermal analysis (without DTM) SMT vs. CMP performance, power and thermal analysis (with DTM) Conclusions and future work

5 5 © 2005, Yingmin Li Performance sensitivity with different L2 size CMP L2 size = SMT L2 size – 1MB

6 6 © 2005, Yingmin Li Modeling and Validation Performance: Turandot with SMT and CMP augmentations, validated against Power4 preRTL model Power: PowerTimer with SMT and CMP augmentations, validated against CPAM power data extracted from circuit Temperature: Hotspot from UVA integrated with Turandot/PowerTimer, validated with test chips at UVA

7 7 © 2005, Yingmin Li Turandot/PowerTimer Simulation Framework Supports SMT/CMP Runs on AIX/PowerPC and Linux/Intel platforms PowerTimer based on CPAM data, extracted from circuits See Micro’02 tutorial by Zhigang Hu and David Brooks for details

8 8 © 2005, Yingmin Li Hotspot temperature model Models all parts along both primary and secondary heat transfer paths At arbitrary granularities Fast and accurate Essentially a lumped thermal R-C network

9 9 © 2005, Yingmin Li Peak Temperature of The Hottest Spot for SMT and CMP 3 heat-up mechanisms Unit self heating determined by the power density of the unit Global heating through TIM (thermal interface material) and spreader Lateral thermal coupling between neighboring units

10 10 © 2005, Yingmin Li Heat Flow of Global Heat-up

11 11 © 2005, Yingmin Li Illustration (global heat-up of CMP vs. local heat-up of SMT)

12 12 © 2005, Yingmin Li Temperature Trend with technology evolution Increased utilization of SMT becomes muted L2 cache tends to be much cooler than the core Expotential temperature dependence of leakage

13 13 © 2005, Yingmin Li SMT vs. CMP performance and power efficiency analysis (without DTM) SMT is superior for memory bound(high-l2- cache-miss rate) benchmarks while CMP is superior for non memory bound benchmarks Compute-boundMemory-bound

14 14 © 2005, Yingmin Li The impact of changing L2 size: Examples MCF+MCFMCF+VPR Stays memory bound when L2 size changes Changes from memory bound to non memory bound when L2 size changes

15 15 © 2005, Yingmin Li SMT vs. CMP performance with DTM Localized DTM method favors SMT while global DTM method favors CMP Global technique Global DVS Fetch throttling Local technique Rename throttling Register file throttling (ideal) Compute-boundMemory-bound

16 16 © 2005, Yingmin Li SMT energy efficiency with DTM Localized method can lead to better energy-delay product result compared with global method in some cases. Compute-boundMemory-bound

17 17 © 2005, Yingmin Li CMP energy efficiency with DTM Localized method is inferior for CMP in terms of energy and energy delay product metrics Compute-boundMemory-bound

18 18 © 2005, Yingmin Li Conclusions With the same chip area, SMT performs better than CMP for memory bound benchmarks while CMP performs better than SMT for non memory bound benchmarks with Apple G5 like architecture. The thermal heating effects are quite different for CMP and SMT CMP machines are clearly hotter than SMT machines with leaky technology Different DTM technique favors different architecture

19 19 © 2005, Yingmin Li Future Work Consider significantly larger amounts of thread- level parallelism and hybrids between CMP and SMT cores The impact of varying core complexity on the performance of SMT and CMP, and explore a wider range of design options, like SMT fetch policies. Explore server-oriented workloads

20 20 © 2005, Yingmin Li ST energy efficiency with DTM

21 21 © 2005, Yingmin Li Is maximum temperature a good metric for thermal efficiency?

22 22 © 2005, Yingmin Li Temperature variation of architectural units with DTM applied

23 23 © 2005, Yingmin Li Motivation With rapid increase of chip power density, thermal concern is more and more important Both SMT and CMP can lead to serious thermal problem Very little research work is done on SMT and CMP thermal properties investigation and comparison


Download ppt "Performance, Energy and Thermal Considerations of SMT and CMP architectures Yingmin Li, David Brooks, Zhigang Hu, Kevin Skadron Dept. of Computer Science,"

Similar presentations


Ads by Google