Executing Parallel Programs with Potential Bottlenecks Efficiently Yoshihiro Oyama Kenjiro Taura Akinori Yonezawa {oyama, tau, University of Tokyo
Bottlenecks bottleneck object (e.g.,shared counter) …….. The execution time here is very large Research context: Implementing a concurrent OO language on SMP or DSM machines concurrent invocations e.g., synchronized methods in Java exclusive method The methods are serialized Update! exclusive method exclusive method exclusive method
Speedup Curves for Programs with Bottlenecks processors time ideal reality Good compilers should give this curve!!! We may execute a program on too many processors (because it is not always easy to predict dynamic behavior).
Goal other parts Naïve Implementation time Ideal Implementation 1PE 50PE bottleneck parts other parts Making the whole execution time on multiprocessors the time to sequentially execute bottlenecks only close to bottleneck parts 50PE bottleneck parts other parts
Experiment using Counter Program in C Solaris threads & Ultra Enterprise Each processor increments a shared counter in parallel
Implementation with Spinlocks object data method Advantage: No need to move “computation” among processors Disadvantage: Frequent cache misses in reading a bottleneck object (because of cache invalidation by other processors) bottleneck object method Each processor executes methods by itself
non-owners Implementation with Simple Blocking Locks bottleneck object a queue of “contexts” owner object data Advantage: Disadvantage: Few cache misses in reading a bottleneck object Overheads to move “computation” Owner dequeues contexts one by one with mutex operations enqueue dequeue
Overview of Our Scheme _ Improvement of simple blocking locks –Overheads in simple blocking locks ` Mutex operations for a queue of contexts ` Waiting time imposed on an owner for mutex ` Cache misses in reading contexts –Solution ` Detaching a whole list of contexts from an object ` Giving higher priority to an owner ` Prefetching context data
Y BCD Our Scheme (Inserting a Context) bottleneck object A When a non-owner invokes a method X a list of contexts Y Z non-ownersowner BCD bottleneck object A X Z context inserted
Our Scheme (Detaching Contexts) When an owner executes methods YBCD bottleneck object A X Z list detached!!! YBCD bottleneck object A XZ Many mutex operations by owner are eliminated contexts inserted contexts are executed in turn without mutex ops for the list
Our Scheme (Low-Level Implementation) Owner no longer has the overhead of waiting time for mutex bottleneck object non-owners (with low priority) owner (with high priority) updating the area with swap updating the area with compare-and-swap one word area Detachment: always succeeds in constant time Insertion: may fail many times Why one word? Why list, not queue? To make our algorithm lock-free and non-blocking
Compile-time Optimizations _ Prefetching context data _ Assigning object data to registers While this context is executed, this context is prefetched passing object data on registers These processing is realized implicitly by the compiler and runtime of a concurrent OO language Schematic The number of cache misses in reading contexts is reduced detached contexts
Experimental Results (1)
Experimental Results (2)
メインの説明はここまで
The Other Interesting Facts _ Waiting time for mutex is very large –70 % of owner’s execution time _ Our scheme gives good performance also on uniprocessor –spinlock: 641 msec –simple blocking lock: 1025 msec –our scheme: 810 msec (the execution time of a simple counter program)
Examples of Bottlenecks _ MT-unsafe libraries –Many libraries assume single-threaded use _ I/O calls –printf, etc. _ Stub objects in distributed systems –One representative object is responsible for all communication in a site _ Shared global variables –e.g., counters to collect statistics information
Limitations _ Our scheme may use large memory –Non-owners create many contexts _ Our scheme does not guarantee FIFO scheduling of methods in an object –Simple solution is reversing a detached list
Future Work _ Solving a potential problem in memory use –Problem: Huge memory may be required for contexts –Simple solution: switch to local-based execution when memory for contexts exceeds some threshold Owner-based execution More efficient in bottlenecks Using more memory Local-based execution Less efficient in bottlenecks Using less memory switch dynamically ……….
Achieving the Same Effect in Low-level Languages (e.g., in C) _ Typical behavior of programmers –Local-based execution in non-bottlenecks –Owner-based execution in bottlenecks Disadvantages Some bottlenecks emerge dynamically (under the effect of the number of processors and runtime parameters) It is tedious to implement owner-based execution (because context data structure varies according to objects and methods)