Jeremy Denham April 7, 2008
Motivation Background / Previous work Experimentation Results Questions
Modern processor design trends are primarily concerned with the multi-core design paradigm. Still figuring out what to do with them Different way of thinking about “shared-memory multiprocessors” Distributed apps? Synchronization will be important.
Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors, Mellor- Crummey & Scott Scalable, busy-wait synchronization algorithms No memory or interconnect contention O(1) remote references per mechanism utilization Spin locks and barriers
“Spin” on lock by busy-waiting until available. Typically involves “fetch-and-Φ” operations Must be atomic!
“Test-and-set” Needs processor support to make it atomic “fetch-and-store” xchg works in x86 Loop until lock is possessed Expensive! Frequently accessed, too Networking issues
Can reduce fetch-and-Φ ops to one per lock acquisition FIFO service guarantee Two counters Requests Releases fetch_and_increment request counter Wait until release counter reflects turn Still problematic…
T.E. Anderson Incoming processes put themselves in the queue Lock holder hands off the lock to next in queue Faster than ticket, but more space
FIFO Guarantee Local spinning! Small constant amount of space Cache coherence a non-issue
Each processor allocates a record next link boolean flag Adds to queue Spins locally Owner passes lock to next user in queue as necessary
Mechanism for “phase separation” Block processes from proceeding until all others have reached a checkpoint Designed for repetitive use
“Local” and “global” sense As processor arrives Reverse local sense Signal its arrival If last, reverse global sense Else spin Lots of spinning…
Barrier information is “disseminated” algorithmically At each synchronization stage k, processor i signals processor (i + 2 k ) mod P, where P is the number of processors Similarly, processor i continues when it is signaled by processor (i - 2 k ) mod P log(P) operations on critical path, P log(P) remote operations
Tree-based approach Outcome statically determined “Roles” for each round “loser” notifies “winner,” then drops out “winner” waits to be notified, participates in next round “champion” sets global flag when over log(P) rounds Heavy interconnect traffic…
Also tree-based Local spinning O(P) space for P processors (2P – 2) network transactions O(log P) network transactions on critical path
Use two P-node trees “child-not-ready” flag for each child present in parent When all children have signaled arrival, parent signals its parent When root detects all children have arrived, signals to the group that it can proceed to next barrier.
Experiments done on BBN Butterfly 1 and Sequent Symmetry Model B machines BBN Supports up to 256 processor nodes 8 MHz MC68000 Sequent Supports up to 30 processor nodes 16 MHz Intel Most concerned with Sequent
Want to extend to multi-core machines Scalability of limited usefulness (not that many cores) Shared resources Core load
Intel Centrino Duo T5200 Processor Two cores 1.60 GHz per core 2MB L2 Cache Windows Vista 2GB DDR2 Memory
Evaluate basic and MCS approaches Simple and complex evaluations Core pinning Load ramping
Code porting Lots of Linux-specific code Win32 Thread API Esoteric… How to pin a thread to a core? Timing Win32 μsec-granularity measurement Surprisingly archaic C code
Spin lock base code ported Barriers nearly done Simple experiments for spin locks done More complex on the way
Simple spin lock tests Simple lock outperforms MCS on: ▪ Empty Critical Section ▪ Simple FP Critical Section ▪ Single core ▪ Dual core More procedural overhead for MCS on small scale Next steps: ▪ More threads! ▪ More critical section complexity