Download presentation
Presentation is loading. Please wait.
1
Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel Workloads 1 Technion – Israel Institute of Technology, 2 Microsoft Corporation
2
Challenges: Single-core performance trend is gloomy Exploit chip-multiprocessors with multithreaded applications The memory gap is paramount Latency, bandwidth, power 2 Chip-Multiprocessor Era 2 [Figure: Hennessy and Patterson, Computer Architecture- A Quantitative approach] Two basic remedies: Cache – Reduce the number of out-of-die memory accesses Multi-threading – Hide memory accesses behind threads execution How do they play together? How do we make the most out of them?
3
The many-core span Cache-Machines ↔ MT-Machines A high-level analytical model Performance curves study Few examples Summary 3 Outline 3
4
The many-core span Cache-Machines ↔ MT-Machines A high-level analytical model Performance curves study Few examples Summary 4 Outline 4
5
Cache-Machines vs. MT-Machines # of Threads Cache/Thread Thread Context Cache Cache Architecture Region Many-Core – CMP with many, simple cores Tens hundreds of Processing Elements (PEs) MT Architecture Region Intel’s Larrabee … Nvidia’s GT200 5 Nvidia’s Fermi Cache Core Multi-Core Region Uni-Processor Region Cache c c c c c c c c What are the basic tradeoffs? How will workloads behave across the range? Predicting performance
6
The many-core span Cache-Machines ↔ MT-Machines A high-level analytical model Performance curves study Few examples Summary 6 Outline 6
7
Use both cache and many threads to shield memory access The uniform framework renders the comparison meaningful We derive simple, parameterized equations for performance, power, BW,.. A Unified Machine Model 7 C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C Cache To Memory Threads Architectural States C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C
8
Cache Machines 8 C C Many cores (each may have its private L1) behind a shared cache C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C Cache To Memory C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C # Threads Performance Cache Non Effective point (CNE)
9
Memory latency shielded by multiple thread execution Multi-Thread Machines C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C To Memory C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C Threads Architectural States Bandwidth Limitations # Threads Performance Max performance execution Memory access 9
10
Analysis (1/3) Given a ratio of memory access instructions r m (0≤r m ≤1) Every 1/r m instruction accesses memory A thread executes 1/r m instructions Then stalls for t avg cycles t avg =Average Memory Access Time (AMAT) [cycles] 10 Cache Thread Context t [cycles] ld
11
PE stays idle unless filled with instructions from other threads Each thread occupies the PE for additional cycles threads needed to fully utilize each PE Analysis (2/3) t [cycles] ld 11 Cache Thread Context
12
Analysis (3/3) Machine utilization: Performance in Operations Per Seconds [OPS]: Number of available threads Peak Performance #Threads needed to utilize a single PE 12 Cache Thread Context
13
Performance Model 13 PE UtilizationOff-Chip BW Power
14
The many-core span Cache-Machines ↔ MT-Machines A high-level analytical model Performance curves study Few examples Summary 14 Outline
15
15 # Threads 3 regions: Cache efficiency region, The Valley, MT efficiency region Unified Machine Performance Performance Cache region MT region The Valley
16
Workloads: Can be parallelized into large number of threads No serial part Threads are independent of each other No wait time/synchronization No data sharing: Cache capacity divided among all running threads Cache hit rate function: HW/SW Assumptions ParameterValue N PE 1024 S$S$ 16 MByte CPI exe 1 f1 GHz tmtm 200 cycles rmrm 0.2 Hardware:
17
Increase in cache size cache suffices for more in-flight threads Extends the $ region 17 Increase in cache size Cache Size Impact..AND also ..AND also Valuable in the MT region Caches reduce off-chip bandwidth delay the BW saturation point
18
Increase in memory latency Hinders the MT region Emphasise the importance of caches Unlimited BW to memory Increase in Memory latency Unlimited BW to memory Memory Latency Impact 18
19
Simulation results from the PARSEC workloads kit Swaptions: Perfect Valley Hit Rate Function Impact 19
20
Simulation results from the PARSEC workloads kit Raytrace: Monotonically-increasing performance Hit Rate Function Impact 20
21
Three applications families based on cache miss rate dependency: A “strong” function of number of threads – f(N q ) when q>1 A “weak” function of number of threads - f(N q ) when q≤1 Not a function of number of threads Threads Performance Hit Rate Dependency – 3 Classes Performance # Threads 21
22
Simulation results from the PARSEC workloads kit Canneal Not enough parallelism available Workload Parallelism Impact 22
23
The many-core span Cache-Machines ↔ MT-Machines A high-level analytical model Performance curves study Few examples Summary 23 Outline
24
A high-level model for many-core engines A unified framework for machines and workloads from across the range A vehicle to derive intuition Qualitative study of the tradeoffs A tool to understand parameters impact Identifies new behaviors and the applications that exhibit them Enables reasoning of complex phenomena First step towards escaping the valley 24 Summary Thank You! zguz@tx.technion.ac.il
25
25 Backup
26
26 Model Parameters 26
27
27 Model Parameters 27 ParameterDescription N PE Number of PEs (in-order processing elements) S$S$ Cache size [Bytes] N max Maximal number of thread contexts in the register file CPI exe Average number of cycles required to execute an instruction assuming a perfect (zero-latency) memory system [cycles] f Processor frequency [Hz] t$t$ Cache latency [cycles] tmtm Memory latency [cycles] BW max Maximal off-chip bandwidth [GB/sec] b reg Operands size [Bytes] Machine parameters:
28
28 Model Parameters 28 Workload parameters: ParameterDescription n Number of threads that execute or are in ready state (not blocked) concurrently rmrm Fraction of instructions accessing memory out of the total number of instructions [0≤r m ≤1] P hit (s, n) Cache hit rate for each thread, when n threads are using a cache of size s
29
29 Model Parameters 29 Power parameters: ParameterDescription e ex Energy per operation [j] e$e$ Energy per cache access [j] e mem Energy per memory access [j] Power leakage Leakage power [W]
30
30 Parsec Workloads 30
31
Model Validation, PARSEC Workloads
32
Related Work 32
33
Similar approach of using high level models: Morad et al., CA-Letters 2005 Hill and Michael, IEEE Computer 2008 Eyerman and Eeckhout, ISCA-2010 Related Work 33 Agrawal, TPDS-1992 Saavedra-Barrera and Culler, Berkeley 1991 Sorin et al., ISCA-1998 Hong and Kim, ISCA-2009 Baghsorkhi et al., PPoPP-2010 Thread Context Cache Cache Architecture Region MT Architecture Region Cache Core Multi-Core Region Uni-Processor Region Cache c c c c c c c c
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.