Presentation is loading. Please wait.

Presentation is loading. Please wait.

Jeremy Denham April 7, 2008.  Motivation  Background / Previous work  Experimentation  Results  Questions.

Similar presentations


Presentation on theme: "Jeremy Denham April 7, 2008.  Motivation  Background / Previous work  Experimentation  Results  Questions."— Presentation transcript:

1 Jeremy Denham April 7, 2008

2  Motivation  Background / Previous work  Experimentation  Results  Questions

3  Modern processor design trends are primarily concerned with the multi-core design paradigm.  Still figuring out what to do with them  Different way of thinking about “shared-memory multiprocessors”  Distributed apps?  Synchronization will be important.

4  Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors, Mellor- Crummey & Scott 1991.  Scalable, busy-wait synchronization algorithms  No memory or interconnect contention  O(1) remote references per mechanism utilization  Spin locks and barriers

5  “Spin” on lock by busy-waiting until available.  Typically involves “fetch-and-Φ” operations  Must be atomic!

6  “Test-and-set”  Needs processor support to make it atomic  “fetch-and-store”  xchg works in x86  Loop until lock is possessed  Expensive!  Frequently accessed, too  Networking issues

7  Can reduce fetch-and-Φ ops to one per lock acquisition  FIFO service guarantee  Two counters  Requests  Releases  fetch_and_increment request counter  Wait until release counter reflects turn  Still problematic…

8  T.E. Anderson  Incoming processes put themselves in the queue  Lock holder hands off the lock to next in queue  Faster than ticket, but more space

9  FIFO Guarantee  Local spinning!  Small constant amount of space  Cache coherence a non-issue

10  Each processor allocates a record  next link  boolean flag  Adds to queue  Spins locally  Owner passes lock to next user in queue as necessary

11  Mechanism for “phase separation”  Block processes from proceeding until all others have reached a checkpoint  Designed for repetitive use

12  “Local” and “global” sense  As processor arrives  Reverse local sense  Signal its arrival  If last, reverse global sense  Else spin  Lots of spinning…

13  Barrier information is “disseminated” algorithmically  At each synchronization stage k, processor i signals processor (i + 2 k ) mod P, where P is the number of processors  Similarly, processor i continues when it is signaled by processor (i - 2 k ) mod P  log(P) operations on critical path, P log(P) remote operations

14  Tree-based approach  Outcome statically determined  “Roles” for each round  “loser” notifies “winner,” then drops out  “winner” waits to be notified, participates in next round  “champion” sets global flag when over  log(P) rounds  Heavy interconnect traffic…

15  Also tree-based  Local spinning  O(P) space for P processors  (2P – 2) network transactions  O(log P) network transactions on critical path

16  Use two P-node trees  “child-not-ready” flag for each child present in parent  When all children have signaled arrival, parent signals its parent  When root detects all children have arrived, signals to the group that it can proceed to next barrier.

17  Experiments done on BBN Butterfly 1 and Sequent Symmetry Model B machines  BBN  Supports up to 256 processor nodes  8 MHz MC68000  Sequent  Supports up to 30 processor nodes  16 MHz Intel 80386  Most concerned with Sequent

18

19

20

21  Want to extend to multi-core machines  Scalability of limited usefulness (not that many cores)  Shared resources  Core load

22  Intel Centrino Duo T5200 Processor  Two cores  1.60 GHz per core  2MB L2 Cache  Windows Vista  2GB DDR2 Memory

23  Evaluate basic and MCS approaches  Simple and complex evaluations  Core pinning  Load ramping

24  Code porting  Lots of Linux-specific code  Win32 Thread API  Esoteric…  How to pin a thread to a core?  Timing  Win32 μsec-granularity measurement  Surprisingly archaic C code

25  Spin lock base code ported  Barriers nearly done  Simple experiments for spin locks done  More complex on the way

26  Simple spin lock tests  Simple lock outperforms MCS on: ▪ Empty Critical Section ▪ Simple FP Critical Section ▪ Single core ▪ Dual core  More procedural overhead for MCS on small scale  Next steps: ▪ More threads! ▪ More critical section complexity

27


Download ppt "Jeremy Denham April 7, 2008.  Motivation  Background / Previous work  Experimentation  Results  Questions."

Similar presentations


Ads by Google