Download presentation
Presentation is loading. Please wait.
Published byDaniela Boyd Modified over 9 years ago
1
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 1 Concurrency in Programming Languages Matthew J. Sottile Timothy G. Mattson Craig E Rasmussen
2
Chapter 8 Objectives Understand performance implications of the memory hierarchy Look at Amdahl’s law and the concept of parallel speedup Understand sources of performance overhead that limit parallel speedup © 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 2
3
3 Outline CPU vs. Memory Performance –Impact on concurrent programs Amdahl’s law and parallel speedup Overhead associated with lock-based algorithms Thread overhead considerations
4
CPU vs. Memory Performance Gap CPU cycle time is much faster than latency to access memory. This means that one memory access will cause many cycles to be wasted if the CPU is waiting for it. –Utilization of machine decreases. John Backus coined the term for this gap between CPU and memory performance in 1977: –The von Neumann bottleneck © 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 4
5
von Neumann Bottleneck Named after the inventor of the model of computation used in modern computers. Stored program computers with a memory connected to a CPU in which instructions and data are fetched, decoded, executed, and stored. Frequent movement across the CPU/Memory boundary to fetch and store. –Implies that latency between CPU and memory is a primary performance factor. © 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 5
6
Hardware solution The solution: hardware assistance to hide the latency to memory. © 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 6 Memory CPU Slow
7
Hardware solution The solution: hardware assistance to hide the latency to memory. Small, low latency cache memory holds a copy of memory near recent accesses. © 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 7 Memory CPU SlowFast Cache
8
Common SMP activity overheads © 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 8 The gap between cache latency (L1 cache hit) and main memory access (miss all caches) is big!
9
Locality The key is to exploit locality properties of programs. –Given an access to location X, it is likely that subsequent accesses will be very close to X (spatial locality). –Furthermore, these access will happen soon after location X was accessed (temporal locality). These properties occur commonly in programs, which is why caches are so successful and widespread. © 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 9
10
Violating locality Locality implies that data layout in memory is related to the access pattern. –E.g.: Accessing a matrix one row at a time means we want rows to be contiguous in memory. If access pattern doesn’t match data layout, we lose locality. –Performance enhancement due to cache goes away. –E.g.: Lay out matrix with columns contiguous, but access one row at a time. © 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 10
11
Parallelism and memory The cache solution to the memory latency problem becomes complicated when multiple CPUs or cores share a single main memory. This is the case in a common multicore system. How do we: –Maintain the cache structure to hide latency, –While ensuring that cores all see a consistent view of memory? © 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 11
12
The Problem Say core 1 reads a memory location X and pulls in its neighbors on a cache line into its L1 cache. Core 2 then writes to memory location X, changing the value in its own cache and main memory. How do we ensure that core 1 does not use the out of date copy of location X? …Cache coherence protocols. © 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 12
13
Cache coherence protocols Cache coherence protocols maintain this coherent view of memory for each core When memory is accessed (either read or written) caches observe activity on the bus and update the state of the data they contain –Invalidating it if it is no longer up to date –Marking values that are replicated between caches as shared –Marking values as modified –Establishing exclusive ownership of a memory location © 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 13
14
Cache coherence protocols The protocol defines a state machine that tells the cache how to treat data that it contains. –Cache reads a value from memory that no other core has read, marks it as exclusively owned. –Cache observes another core reading the same memory, transitions the state from exclusive to shared. –Cache observes another core write to that memory, transitions the state to invalid to force an update if it is accessed again. Many protocols exist, and different ones are used by different CPU manufacturers. © 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 14
15
Implications for performance A parallel algorithm should avoid causing the CC protocol to frequently invalidate cached data, which would result in high frequency access to slow main memory. In other words, cores should be used in a way that reduces overlap of accesses. © 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 15
16
Example Consider three threads across three cores. In the bad case, cores access memory near each other, resulting in frequent invalidations of cached data. In the good case, they reduce the overlap to a minimal set of locations at the boundaries of the regions that they are using. © 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 16
17
Data layout The key is to understand how data layout of a parallel program relates to the access pattern of the concurrently executing threads. For languages, this requires understanding the layout patterns that are assumed for data structures like arrays. Example: row-major versus column-major layout of multidimensional arrays. –Fortran vs. C style © 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 17
18
Measurement How do we measure if our programs suffer from performance problems related to the cache? Performance counters. Most modern CPUs provide registers that count events like cache misses, coherence protocol activities, etc… Profiling code and looking at those counters in regions of the parallel program can give insight into whether this is a cause of performance problems. © 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 18
19
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 19 Outline CPU vs. Memory Performance –Impact on concurrent programs Amdahl’s law and parallel speedup Overhead associated with lock-based algorithms Thread overhead considerations
20
Speedup Why do we go parallel? Often we have a problem with a fixed size that we want to make go faster by having multiple cores working on parts of it at once. How do we measure the benefit of parallelism? Speedup: How much faster did the problem get solved in parallel compared to a sequential version. © 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 20
21
Perfect speedup Sequential version of program takes t s time Given p processors, parallel version would ideally take t p = t s /p time A class of problems known as embarassingly parallel problems have speedups that approach this. –No dependencies between parallel threads: all of them execute without ever interacting in any way with the others. © 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 21
22
Realistic speedup Even embarassingly parallel problems don’t have perfect speedup because there is typically time required to generate the work to execute in parallel and to gather the results of the parallel threads to form the final solution. –These activities are typically sequential. We break the time to execute a parallel program then into the parts that are sequential and the parts that are parallel. –T = t s + t p © 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 22
23
Realistic speedup T = t s + t p As the number of parallel threads increases, t p approaches a very small constant. Therefore in the limit, the parallel program performance is bounded by the parts that are intrinsically sequential. © 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 23
24
Amdahl’s law This is Amdahl’s law. The speedup of a program that has any components that are not concurrent cannot ever be perfect as it is ultimately bound by the sequential parts. In some cases, this means the code will never get faster than the time to set up and finalize the problem. In more realistic cases, sequentializtion occurs due to synchronization between cores during parallel execution. © 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 24
25
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 25 Outline CPU vs. Memory Performance –Impact on concurrent programs Amdahl’s law and parallel speedup Overhead associated with lock-based algorithms Thread overhead considerations
26
Locks and mutual exclusion Recall that mutual exclusion means that for a critical section of code, only one thread will ever be allowed to execute inside it at a given time. Concurrent threads of execution are serialized then through critical sections. Locking is often based on the use of critical sections. –Not always: hardware assistance may exist. –This discussion assumes the case where it does not. © 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 26
27
Lock overhead Locking induces overhead. Locking requires entry into mutually exclusive regions of code. –This leads to potential serialization if there is contention for these regions. –Some cores sit idle waiting their turn. The use of locks can cause performance degradation as a result. © 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 27
28
Serialization due to locks If many threads simultaneously attempt to access the same lock, they will be serialized. –One at a time they will acquire the lock and will sit waiting until that time. A frequently used lock may cause frequent contention. –Which corresponds to frequent serialization, and high amounts of time spent in an unproductive wait state. © 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 28
29
Conservative locking A common problem in concurrent programs is excessive, conservative locking. Acquiring a lock even if, ultimately, it wasn’t actually necessary. –E.g.: protecting a critical section that exists in a rarely executed branch of a conditional, but acquiring the lock before testing the condition. Excessive use of the Java “synchronized” keyword can have this result. © 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 29
30
Lock overhead Even in the case where serialization does not occur, locks induce overhead. –Time to acquire lock –Time to release lock If critical section protected by lock is executed frequently, every time it is executed we need to pay this price. –This can add up and reduce speedup since lock overhead is additional work not present in sequential algorithm. © 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 30
31
Optimistic schemes Optimistic schemes avoid locking at all costs unless absolutely necessary. Consider a program where multiple threads attempting to perform an activity protected by a lock at the same time is very rare. –It might be cheaper to not lock, but instead check to see if a conflict occurred after doing the work, and pay the penalty for cleaning up the mess in this rare case. This is the basis of the transaction concept. © 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 31
32
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 32 Outline CPU vs. Memory Performance –Impact on concurrent programs Amdahl’s law and parallel speedup Overhead associated with lock-based algorithms Thread overhead considerations
33
Thread overhead Threads themselves are often not free. Some overhead exists to schedule threads onto cores, and to create and destroy them. This overhead varies drastically between systems. –E.g.: Kernel thread vs. process vs. user thread based runtimes. © 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 33
34
Thread overhead Understanding the thread overhead for the system you choose to use is important. How many threads to use? What granularity of computation for individual threads to handle? Reuse of existing threads for new work versus creation of fresh threads? © 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 34
35
Thread overhead Consider these three performance metrics: –Computation requires t c time. –Thread creation and scheduling takes t s time. –Thread destruction and cleanup takes t f time. Overall time for the computation is therefore: –t c +t s +t f © 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 35
36
Implication The implication of this is that we would ideally choose computations to execute in a thread that take enough time such that t s and t f are effectively zero. If we choose too small of a computation, the time to execute the thread may be dominated by the time to start and stop the thread itself. –This reduces the effectiveness of the parallel algorithm! © 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 36
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.