Presentation is loading. Please wait.

Presentation is loading. Please wait.

Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel.

Similar presentations


Presentation on theme: "Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel."— Presentation transcript:

1 Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel Workloads 1 Technion – Israel Institute of Technology, 2 Microsoft Corporation

2 Challenges:  Single-core performance trend is gloomy  Exploit chip-multiprocessors with multithreaded applications  The memory gap is paramount Latency, bandwidth, power 2 Chip-Multiprocessor Era 2 [Figure: Hennessy and Patterson, Computer Architecture- A Quantitative approach] Two basic remedies:  Cache – Reduce the number of out-of-die memory accesses  Multi-threading – Hide memory accesses behind threads execution How do they play together? How do we make the most out of them?

3 The many-core span  Cache-Machines ↔ MT-Machines A high-level analytical model Performance curves study  Few examples Summary 3 Outline 3

4 The many-core span  Cache-Machines ↔ MT-Machines A high-level analytical model Performance curves study  Few examples Summary 4 Outline 4

5 Cache-Machines vs. MT-Machines # of Threads Cache/Thread Thread Context Cache Cache Architecture Region Many-Core – CMP with many, simple cores  Tens  hundreds of Processing Elements (PEs) MT Architecture Region Intel’s Larrabee … Nvidia’s GT200 5 Nvidia’s Fermi Cache Core Multi-Core Region Uni-Processor Region Cache c c c c c c c c What are the basic tradeoffs? How will workloads behave across the range?  Predicting performance

6 The many-core span  Cache-Machines ↔ MT-Machines A high-level analytical model Performance curves study  Few examples Summary 6 Outline 6

7 Use both cache and many threads to shield memory access  The uniform framework renders the comparison meaningful  We derive simple, parameterized equations for performance, power, BW,.. A Unified Machine Model 7 C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C Cache To Memory Threads Architectural States C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C

8 Cache Machines 8 C C Many cores (each may have its private L1) behind a shared cache C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C Cache To Memory C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C # Threads Performance Cache Non Effective point (CNE)

9 Memory latency shielded by multiple thread execution Multi-Thread Machines C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C To Memory C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C Threads Architectural States Bandwidth Limitations # Threads Performance Max performance execution Memory access 9

10 Analysis (1/3) Given a ratio of memory access instructions r m (0≤r m ≤1) Every 1/r m instruction accesses memory  A thread executes 1/r m instructions  Then stalls for t avg cycles t avg =Average Memory Access Time (AMAT) [cycles] 10 Cache Thread Context t [cycles] ld

11 PE stays idle unless filled with instructions from other threads Each thread occupies the PE for additional cycles  threads needed to fully utilize each PE Analysis (2/3) t [cycles] ld 11 Cache Thread Context

12 Analysis (3/3) Machine utilization: Performance in Operations Per Seconds [OPS]: Number of available threads Peak Performance #Threads needed to utilize a single PE 12 Cache Thread Context

13 Performance Model 13 PE UtilizationOff-Chip BW Power

14 The many-core span  Cache-Machines ↔ MT-Machines A high-level analytical model Performance curves study  Few examples Summary 14 Outline

15 15 # Threads 3 regions: Cache efficiency region, The Valley, MT efficiency region Unified Machine Performance Performance Cache region MT region The Valley

16 Workloads: Can be parallelized into large number of threads No serial part Threads are independent of each other  No wait time/synchronization No data sharing:  Cache capacity divided among all running threads  Cache hit rate function: HW/SW Assumptions ParameterValue N PE 1024 S$S$ 16 MByte CPI exe 1 f1 GHz tmtm 200 cycles rmrm 0.2 Hardware:

17 Increase in cache size  cache suffices for more in-flight threads  Extends the $ region 17 Increase in cache size Cache Size Impact..AND also ..AND also  Valuable in the MT region  Caches reduce off-chip bandwidth  delay the BW saturation point

18 Increase in memory latency  Hinders the MT region  Emphasise the importance of caches Unlimited BW to memory Increase in Memory latency Unlimited BW to memory Memory Latency Impact 18

19 Simulation results from the PARSEC workloads kit Swaptions:  Perfect Valley Hit Rate Function Impact 19

20 Simulation results from the PARSEC workloads kit Raytrace:  Monotonically-increasing performance Hit Rate Function Impact 20

21 Three applications families based on cache miss rate dependency:  A “strong” function of number of threads – f(N q ) when q>1  A “weak” function of number of threads - f(N q ) when q≤1  Not a function of number of threads Threads Performance Hit Rate Dependency – 3 Classes Performance # Threads 21

22 Simulation results from the PARSEC workloads kit Canneal  Not enough parallelism available Workload Parallelism Impact 22

23 The many-core span  Cache-Machines ↔ MT-Machines A high-level analytical model Performance curves study  Few examples Summary 23 Outline

24 A high-level model for many-core engines  A unified framework for machines and workloads from across the range A vehicle to derive intuition  Qualitative study of the tradeoffs  A tool to understand parameters impact  Identifies new behaviors and the applications that exhibit them  Enables reasoning of complex phenomena First step towards escaping the valley 24 Summary Thank You! zguz@tx.technion.ac.il

25 25 Backup

26 26 Model Parameters 26

27 27 Model Parameters 27 ParameterDescription N PE Number of PEs (in-order processing elements) S$S$ Cache size [Bytes] N max Maximal number of thread contexts in the register file CPI exe Average number of cycles required to execute an instruction assuming a perfect (zero-latency) memory system [cycles] f Processor frequency [Hz] t$t$ Cache latency [cycles] tmtm Memory latency [cycles] BW max Maximal off-chip bandwidth [GB/sec] b reg Operands size [Bytes] Machine parameters:

28 28 Model Parameters 28 Workload parameters: ParameterDescription n Number of threads that execute or are in ready state (not blocked) concurrently rmrm Fraction of instructions accessing memory out of the total number of instructions [0≤r m ≤1] P hit (s, n) Cache hit rate for each thread, when n threads are using a cache of size s

29 29 Model Parameters 29 Power parameters: ParameterDescription e ex Energy per operation [j] e$e$ Energy per cache access [j] e mem Energy per memory access [j] Power leakage Leakage power [W]

30 30 Parsec Workloads 30

31 Model Validation, PARSEC Workloads

32 Related Work 32

33 Similar approach of using high level models:  Morad et al., CA-Letters 2005  Hill and Michael, IEEE Computer 2008  Eyerman and Eeckhout, ISCA-2010 Related Work 33 Agrawal, TPDS-1992 Saavedra-Barrera and Culler, Berkeley 1991 Sorin et al., ISCA-1998 Hong and Kim, ISCA-2009 Baghsorkhi et al., PPoPP-2010 Thread Context Cache Cache Architecture Region MT Architecture Region Cache Core Multi-Core Region Uni-Processor Region Cache c c c c c c c c


Download ppt "Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel."

Similar presentations


Ads by Google