105-07-2002CS757 Lock Behaviour Characterization of Commercial Workloads Jichuan Chang Xidong Wang.

105-07-2002CS757 Lock Behaviour Characterization of Commercial Workloads Jichuan Chang Xidong Wang

205-07-2002CS757 Outline Motivation Methods Results Speculative Lock Elision Issues Conclusions

305-07-2002CS757 Understanding the Synchronization Behavior of Commercial Workloads (OLTP, Apache, SpecJBB) Motivation Identifying Opportunities for Speculative Lock Elision (performance, ease of programming)

405-07-2002CS757 Questions to Answer Lock related statistics –Can hardware identify critical sections? –Critical section size –Lock-free section size –Amount of lock contentions Hardware optimizations by speculation –Context switching implications –Resource requirements Other issues –Realistic timing model –Other synchronization (reader/writer, etc) Lock-free section Critical section Contention (spin/wait) time

505-07-2002CS757 Methods Benchmarks –OLTP, Apache, JBB, Barnes (for comparison) Full system simulation (tracing) using Simics –Simple timing model - Simics tracer –Ruby timing model - Simics + Ruby –Using #instr (not #cycle) as the measurement unit –Set cpu_switch_time to 1, disable STC Validating our approach –Using micro-benchmarks, to compare our stats with the result reported by kernel tools (lockstat) –Tracing into disassembly code (kernel/user)

605-07-2002CS757 Lock Identification Basic idea [from SLE] –Lock acquisition must use one atomic instruction. –Silent store pair: as a pair, the stores in lock acquisition and release operations are silent. SPARC v9 atomic instructions –ldstub, swap, casa (compare-and-swap) casa [%l2] 128,%g4,%g3 … … … casa [%l2] 128,%l0,%g4 ldstub [%o0 + %g0], %o4 brnz,pn %o4, stbar … stb %g0, [%o0 + 12] OLTPJBB 0x0->0xff … 0xff->0x0 0x1->0x8410f8bc … … … 0x8410f8bc->0x1 Values

705-07-2002CS757 Lock Identification Algorithm Starts with an atomic instruction –that writes back a different value to the lock –otherwise meaning unsuccessful lock acquisition Examine each following store made by the same CPU Until we meet a normal store –that completes the silent store pair –usually with the value of 0x0 Other completion patterns –Self-release (by the same CPU) using atomic instruction, pair-silently (JBB) using atomic instruction, not pair-silently –Cross-release (by a different CPU) using atomic instruction; –Removed: can’t observe lock release (16K limited window).

805-07-2002CS757 Lock Frequency

905-07-2002CS757 Execution Phase Breakdown

1005-07-2002CS757 Critical Section Size

1105-07-2002CS757 Lock-free Section Size

1205-07-2002CS757 Timing Models Adding Ruby doesn’t change the size of critical section and lock-free section, but removes lock contentions. Why? –“Shrinking” caused by less frequent memory accesses within critical sections –or simulation effect? Guess: more shrinking using Ruby and Opal Simple Timing Ruby Timing

1305-07-2002CS757 Lock Contention 46%236%70% Waiting: from the first try to successful acquisition Spinning: ignore those have been waiting for more than 4K instructions.

1405-07-2002CS757 Distinguishing “wait” and “spin” Why bother? –Very few long-waiting events make big difference in the percentages of wasted instructions Easy if we can identify thread switching –But the identification is not easy Waiting if spinning for too many instructions –Using 4096 instructions as the limit –90+% contentions are shorter than 4K instr –It makes sense for different timing models.

1505-07-2002CS757 Lock Contention – Most Contended Locks

1605-07-2002CS757 SLE on Commercial Workloads Context switching (later) Buffering requirement – Not much –Small critical sections dominate –Except for Apache user locks (1-8K) –Single shared buffer among threads on the same CPU Possible performance gain –Not big if only counting num of instructions (1 - 6%) Critical section size already small Contention already infrequent –Can be larger if lock spinning latency increases –Can be smaller less lock contentions happen (as in Ruby case) Must throttle speculation (to avoid unnecessary rollbacks)

1705-07-2002CS757 Context Switch Why bother? –Needed to precisely quantify the amount of instructions spent on lock waiting (process and thread switching) –Needed to correctly implement speculative lock elision (process switching only) Process Switching Identification –Marker: Demap TLB on context switch –Apache (100 transactions, CPU #3) Average: ~210K instructions (Max ~360K, Min ~160K) –Process switching are infrequent, performance implication negligible Thread Switching Identification is hard –No simple patterns to observe, No feedback to validate assumptions –Not a good idea to provide separate buffer for each thread on a single processor. Hard to detect conflicts, thread switch & need many buffers.

1805-07-2002CS757 Hard to recognize complex synchronization –Barriers, Read/writer locks, etc Mutual Exclusion implementation composed of the small critical sections –pthread_mutex_lock(&lock) acquires 3 lock –Reader/writer lock use locks to maintain data structure (reader/writer queues, num of current reader, etc) Other Synchronization Algorithms writer_enter () writer_exit () Serialized Execution (maintained by synch. algo.) HW only sees two small critical sections

1905-07-2002CS757 Conclusion Commercial workloads lock characterization –Small critical sections dominate –Infrequent lock contention User/kernel code have different behavior –Kernel locks can’t be ignored –(Kernel) contented PCs predictable Performance Improvements –SLE won’t help as much

2005-07-2002CS757 Thank You! Questions?

2105-07-2002CS757 Backup Slides Thread switching details Critical section size using Ruby timing model Sparc Atomic Instructions Misc Issues Acknowledgement

2205-07-2002CS757 Thread Switch Identification User thread scheduling –Disassemble user thread library, Observe execution of scheduling methods (_disp, _switch). not always possible!! Kernel thread scheduling –Involve a set of interleaved method invocations (resume, disp, swtch, _resume_from_idle..). Hard to identify starting and ending point of thread switch –Impossible to identify kernel thread switch by only observing register window swap since it also happen in user thread switch –No feedback from OS to validate our assumption –Methodology & Preliminary Observations Disassemble kernel code to build VA  kernel method map. Observe the method control flow in Simics trace. resume may indicate a kernel thread switch user_rtt may indicate a user level thread switch. Conclusion: Thread Switch Identification is a hard, unresolved issue

2305-07-2002CS757 Critical Section Size (Ruby)

2405-07-2002CS757 Sparc Atomic Instructions ldstub –Write all 1 into a byte Swap –Swap the value of the reg and the mem location Compare-and-swap –Swap if (value in the 1 st reg == value in mem) Membar/stbar –Usually follows such atomic instructions

2505-07-2002CS757 Misc. Why Apache “strange”? –Lock more frequent, few user lock (1-2%) –Large percentage of critical section instruction Nested Locks Intertwined Locks Critical sections in Barnes are more clustered Buffer size ≤ 2^9 * 30% * 1/3 = 64 Blocks –The same as SLE

2605-07-2002CS757 Acknowledgement Project suggested by Prof. Mark Hill –Guiding and supporting Lots of discussion with and help from –Min Xu, our TA –Carl Mauer, Multifacet simulator expert –Ravi Rajwar, SLE paper author

105-07-2002CS757 Lock Behaviour Characterization of Commercial Workloads Jichuan Chang Xidong Wang.

Similar presentations

Presentation on theme: "105-07-2002CS757 Lock Behaviour Characterization of Commercial Workloads Jichuan Chang Xidong Wang."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

105-07-2002CS757 Lock Behaviour Characterization of Commercial Workloads Jichuan Chang Xidong Wang.

Similar presentations

Presentation on theme: "105-07-2002CS757 Lock Behaviour Characterization of Commercial Workloads Jichuan Chang Xidong Wang."— Presentation transcript:

Similar presentations

About project

Feedback