CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12
Multithreading Overhead Multithreading always introduces overhead. Costly synchronization code even without contention. Synchronization delays when contention occurs. Hardware synchronization (flushes at barriers). Increased context switching. Thread scheduling. Thread creation and destruction. How do we keep the hardware busy? What are the bottlenecks?
Scalability Ability to improve throughput or capacity when adding computing resources -- CPUs, memory, long-term storage, or I/O bandwidth. The bottleneck moves somewhere else. Throughput measures total data processed. Latency measures delays in event responses. Sometimes they trade one against the other. Use a better algorithm.
Avoid premature tradeoffs Make it correct, the fast. An initial correct solution, even if slower, helps you think through the problem, gives you output reference data for when you perform code improvements, and gives you some reusable code. It is like learning how to drive! Know your code libraries.
Amdahl’s Law Speedup <= 1.0 /(F + ((1.0 - F) / N)) F is fraction of computation that is serial. N is number of processors. Sources of serialization – Intrinsic to algorithm (e.g., partitioning in quicksort). Synchronization (e.g., CyclicBarrier in penny – dime). Access to shared data structures such as work queues. Task submission, result processing, join. Hardware serialization (e.g., memory, I/O bandwidth).
Context switching Operating system calls and complex JVM code. Cache and page faults to load and flush a thread’s working set (of data) on a switch. I/O bound threads block more often that CPU bound, resulting in more context switches. High kernel usage (> 10%) indicates heavy scheduling activity, may be caused by frequent blocking due to I/O or lock contention.
Memory synchronization Memory barriers Flush and invalidate registers, data caches, write buffers, stall execution pipelines. Uncontended synchronization overhead is low. JVM run-time compiler can use escape analysis to identify when a “locked” object does not escape its thread, removing the lock. Compiler lock coarsening can merge adjacent requests for the same lock.
Memory sync and blocking Lock elision on locks to thread-confined objects is in IBM JVM and slated for HotSpot 7. Synchronization creates traffic on memory buses. Clear, explicit Java Memory Model creates opportunities for compiler optimizations. Blocking overhead – Spin waiting. O.S. system calls and context switching overhead.
Reducing lock contention Reduce duration during which locks are held. Reduce the frequency of lock requests. Replace exclusive locks with coordination mechanisms that permit greater concurrency. Prefer a dataflow architecture when possible. Lock splitting for independent sets of shared variables guarded by a single lock. Lock striping reduces overhead of excess locks.
Hot fields, exclusive locks A hot field is a single field such as a counter that is accessed by many threads. Parallelize data structures when possible. Merge them in a subset of threads as needed. Cache merged results if invalidating cache is fast. Use alternatives to exclusive locks. ReadWriteLock Atomic variables and volatiles Queues and library classes with reduced or split locks.
Monitoring CPU utilization top, mpstat, vmstat, iostat, sar, cpustat Process and resource profilers and low-level activity register data analyzers. See Summer of 2010 report, final page. Also commercial code profilers for multithreading. Causes of CPU underutilization – Insufficient load I/O bound processing External bottlenecks Lock contention
Miscellaneous Thread pools? Yes Object pooling? No Dynamic allocation and garbage collection is cheaper and easier than application-driven object pools and application-level synchronization of object pools. Reducing context switching overhead. Make some portion of the application architecture CPU intensive, partition it so it can be multithreaded. Let some partitioned set of threads handle the buffered I/O. Only they are candidates for I/O blocking. May introduce possibility of a serial bottleneck in the I/O threads. Use appropriate queuing.