Download presentation
Presentation is loading. Please wait.
Published byKiley Seat Modified over 10 years ago
1
ISPASS 2011 Tanima Dey Wei Wang, Jack W. Davidson, Mary L. Soffa Department of Computer Science University of Virginia 1
2
Motivation The number of cores doubles every 18 months Expected: Performance number of cores One of the bottlenecks is shared resource contention For multi-threaded workloads, contention is unavoidable To reduce contention, it is necessary to understand where and how the contention is created 2
3
Shared Resource Contention in Chip-Multiprocessors Intel Quad Core Q9550 C0C1C2C3 L2 Memory L1 Front -Side Bus 3 Application 1 Thread Application 2 Thread
4
Scenario 1 Multi-threaded applications With co-runner C0C1C2C3 L2 Memory L1 4 Application 1 Thread Application 2 Thread
5
Without co-runner C0C1C2C3 L2 Memory L1 Application Thread 5 Scenario 2 Multi-threaded applications
6
Shared-Resource Contention Intra-application contention Contention among threads from the same application (No co-runners) Inter-application contention Contention among threads from the co-running application 6
7
Contributions A general methodology to evaluate a multi-threaded application’s performance Intra-application contention Inter-application contention Contention in the memory-hierarchy shared resources Characterizing applications facilitates better understanding of the application’s resource sensitivity Thorough performance analyses and characterization of multi-threaded PARSEC benchmarks 7
8
Outline Motivation Contributions Methodology Measuring intra-application contention Measuring inter-application contention Related Work Summary 8
9
Methodology 9 Designed to measure both intra- and inter- application contention for a targeted shared resource L1-cache, L2-cache Front Side Bus (FSB) Each application is run in two configurations Baseline: threads do not share the targeted resource Contention: threads share the targeted resource Multiple number of targeted resource Determine contention by comparing performance (gathering hardware performance counters’ values)
10
Outline Motivation Contributions Methodology Measuring intra-application contention (See paper) Measuring inter-application contention Related Work Summary 10
11
L1-cache Baseline Configuration Contention Configuration Measuring inter-application contention C0C1C2C3 L2 Memory L1 Application 1 Thread Application 2 Thread C0C1C2C3 L2 Memory L1 11
12
Measuring inter-application contention L2-cache Baseline Configuration Contention Configuration C0C1C2C3 L2 Memory L1 Application 1 Thread Application 2 Thread C0C1C2C3 L2 Memory L1 12
13
Measuring inter-application contention FSB Baseline Configuration Memory C0C2C4C6 L2 L1 C1C3C5C7 L2 L1 Application 1 Thread Application 2 Thread 13
14
Measuring intra-application contention FSB Contention Configuration Memory C0C2C4C6 L2 L1 C1C3C5C7 L2 L1 Application 1 Thread Application 2 Thread 14
15
PARSEC Benchmarks 15 Application DomainBenchmark(s) Financial AnalysisBlackscholes (BS) Swaptions (SW) Computer VisionBodytrack (BT) EngineeringCanneal (CN) Enterprise StorageDedup (DD) AnimationFacesim (FA) Fluidanimate (FL) Similarity SearchFerret (FE) RenderingRaytrace (RT) Data MiningStreamcluster (SC) Media ProcessingVips (VP) X264 (X2)
16
Experimental platform Platform 1: Yorkfield Intel Quad core Q9550 32 KB L1-D and L1-I cache 6MB L2-cache 2GB Memory Common FSB C0 L2 cache Memory L1 cache Memory Controller Hub (Northbridge) FSB MB FSB interface L2 cache L2 HW-PF FSB interface L2 HW-PF L1 HW-PF C1C2C3 L1 cache L1 HW-PF L1 cache L1 HW-PF L1 cache L1 HW-PF 16
17
Tanima Dey Experimental platform Memory Memory Controller Hub (Northbridge) FSB MB FSB C0 L2 cache L1 cache FSB interface L2 cache L2 HW-PF FSB interface L2 HW-PF L1 HW-PF C2C4C6 L1 cache L1 HW-PF L1 cache L1 HW-PF L1 cache L1 HW-PF C1 L2 cache L1 cache FSB interface L2 cache L2 HW-PF FSB interface L2 HW-PF L1 HW-PF C3C5C7 L1 cache L1 HW-PF L1 cache L1 HW-PF L1 cache L1 HW-PF Platform 2: Harpertown 17
18
18 Performance Analysis Inter-application contention For i-th co-runner PercentPerformanceDifference i = ( PerformanceBase i – PerformanceContend i ) * 100 PerformanceBase i Absolute performance difference sum APDS = Σ abs ( PercentPerformanceDifference i )
19
Inter-application contention L1-cache – for Streamcluster 19
20
Inter-application L1-cache contention Streamcluster 20
21
21 Inter-application contention 21 L1-cache
22
Inter-application contention 22 L2-cache
23
Inter-application contention FSB 23
24
Characterization 24 BenchmarksL1-cacheL2-cacheFSB Blackscholesnone Bodytrackinter intra Cannealintrainterintra Dedupinterintra, inter Facesiminter intra Ferretintraintra, interintra Fluidanimateinter intra Raytracenone intra Streamclusterinter intra Swaptionsnone Vipsintrainter X264interintra, interintra
25
Summary The methodology generalizes contention analysis of multi-threaded applications New approach to characterize applications Useful for performance analysis of existing and future architecture or benchmarks Helpful for creating new workloads of diverse properties Provides insights for designing improved contention- aware scheduling methods 25
26
Related Work Cache contention Knauerhase et al. IEEE Micro 2008 Zhuravleve et al. ASPLOS 2010 Xie et al. CMP-MSI 2008 Mars et al. HiPEAC 2011 Characterizing parallel workload Jin et al., NASA Technical Report 2009 PARSEC benchmark suite Bienia et al. PACT 2008 Bhadauria et al. IISWC 2009 26
27
Thank you! 27
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.