Anshu Raina Suhas Pai Mohit Verma Vikas Goel Yuvraj Patel

Anshu Raina Suhas Pai Mohit Verma Vikas Goel Yuvraj Patel
Everything you still don’t know about synchronization and how it scales Anshu Raina Suhas Pai Mohit Verma Vikas Goel Yuvraj Patel 9/18/2018

Motivation – Why need synchronization?
Many core architecture a common phenomenon Challenging to scale systems Synchronization Crucial to ensure coordination and correctness Hindrance to scalability Synchronization ensured using primitives Rely on hardware artifacts– sometimes gory details of h/w not known Hard to predict if applications will scale w.r.t a specific synchronization scheme 9/18/2018

What are we trying to study?
Study synchronization under scaling How various hardware artifacts scale? How the higher level synchronization primitives scale? Does hardware architecture impact the performance? What are the overheads that pop up while scaling? 9/18/2018

How do you synchronize? Basic hardware artifacts
CAS, TAS, FAI and other atomic instructions Mutex, Semaphores, Spin locks, Condition Variables, Barrier Different purpose Structure shared by all threads using it Use the above hardware artifacts to update the shared structure atomically 9/18/2018

Synchronization Primitives 101
Basic hardware artifacts CAS - Uses lock cmpxchg TAS - Uses xchg FAI - Uses lock xadd Higher level synchronization primitives Mutex – Used to ensure mutual exclusion, Ownership crucial – lock/unlock Semaphore – Signaling mechanism, Ownership not important – wait/post(signal) Spinlock – Locking mechanism, generally used for smaller critical section Futex Used for performance – avoid syscalls to acquire locks Syscall done only when contention 9/18/2018

Experiments Parameters
Different configurations (Intra socket, Inter socket, Hyperthreading) Thread scaling (1, 2, 4, 8, 14, 28, 56) 28, 56 not done for Intra socket 56 not done for Intra socket & Inter socket Vary Critical section (CS) size Pseudo-code for CS: FOR (0 … LOOP_COUNT) { count := count + 1; (count is volatile) } Experiments done for LOOP_COUNT– 100, 1000, 10000 Layered study Basic Hardware Artifacts – CAS, FAI, TAS Higher level synchronization primitives (musl library) – Mutex, Semaphore, Spinlocks 9/18/2018

Platform/Architecture
Intel Xeon E5 v3 (Haswell EP) 2 sockets, 14 physical active cores per socket(possibly using a 18 core die) Hyperthreaded, 2 threads per core 8 cores, 8 L3 slices, 1 memory controller connected to 1 bi-directional ring. Remaining are connected to another bi-directional ring Ring topology hidden from OS in default configuration COD splits the processors into 2 clusters, topology now has 4 NUMA nodes(but we are seeing only 2 NUMA nodes. Enabling COD also doesn’t show 4 NUMA nodes ) Cache Coherence Mechanisms MESIF implementation Implemented by Caching Agents(CAs) within L3 slice and Home Agents(HAs) with memory controller. Modes Source Mode(enabled by default) Home Mode Cluster-on-Die Mode 9/18/2018

How Do Atomic Instructions Scale?
How do atomic instructions scale with varying contention? Does placement of threads affect scaling? Single Socket – HyperThreading or not Two Sockets – HyperThreading or not How do different atomic instructions vary in latency? Locks are implemented using these atomics Spin locks, Mutex, Semaphores use CAS Does the coherence state of the cache line affect latencies of operations? 9/18/2018

Atomics– Latency trends with increasing threads
9/18/2018

Effect of Thread Placement on Latencies
9/18/2018

Effect of Thread Placement on a Single CAS
Modified exoensive in all  because data has to be written back to mem, No F here. E in CAS expensive, because first do E -> S, Then S-> M, and then S-I for the 1st cache (originally in E) For others E cheap, because data forwarded and ownership lost. Difference because CAS can fail. TAS and FAI always stores some data. 9/18/2018

Insights Latencies of all instructions increas linearly with increasing contention Threads placed on HyperThreaded cores provide improved performance Effects of HyperThreading are more pronounced when threads are on different sockets CAS latency can be very large if threads are placed across sockets (2x more!) Significant because CAS used widely for implementation of locks (spin locks, mutex) - More to be covered in subsequent slides! 9/18/2018

Spinlocks, Mutex and Binary Semaphores
What should I use if my critical section is small? Does number of threads in my application has matter? Does it thread placement matter? What is the worst & best performance I can get? What happens to each of my threads? 9/18/2018

Binary Semaphore/Mutex Behavior as Critical Section Size Changes
Spinlocks usually used when critical section small Binary Semaphore/ Mutex? See for yourself! 9/18/2018

NHT2s_100 9/18/2018

NHT2s_10000 9/18/2018

NHT2s_1000 9/18/2018

General behavior as Critical Section size changes
We looked at what happens when 14 threads try contending at once. When CS large, everyone calls syscall once. How are the threads woken up? FCFS! When CS small, no contention. CS over before other threads even scheduled! When CS size intermediate, some threads call syscall more than once. Since CS is not big enough, some threads weren’t even scheduled yet, and start contending with the thread just woken up. FCFS woken up does not imply FCFS entering CS. 9/18/2018

Spinlocks scaling as number of threads vary?
Spinlocks mostly used with less CS size How does its performance vary with number of threads? Does not scale well, even if CS small Actually worse than mutex, and binary semaphore Why? Back off in mutex and semaphore More later! 9/18/2018

How does mutex/semaphore scale with number of threads?
9/18/2018

How does mutex/semaphore scale with number of threads?
Emphaisize that CS length is small. Every thread is spending exact same time 9/18/2018

Why don’t they scale well? What’s going on inside?
Mutex Try CAS to get the lock Fail? Try CAS again to get the lock Fail? Spin for some time if there are other waiters too! Try CAS on the lock again Fail? Register yourself as a waiter Syscall to Futex 9/18/2018

Semaphore Check the semaphore to see if you can enter the critical section Fail? Try CAS in a loop until you successfully update semaphore Fail? Spin for some time if there are other waiters too! Try CAS to update the semaphore again. Fail? Register yourself as waiter. CAS to update the semaphore Syscall to futex 9/18/2018

9/18/2018

How does the behavior change as thread placement varies?
(For 14 threads) 1st_CAS 2nd_CAS 3rd_CAS while loop try_lock Syscall Spin didn’t_complete No_of_syscalls NHT1s_1000 21.43 78.57 1 7-1/4-2 NHT2s_1000 14.29 7.14 4-1/6-2/1-3 HT_1000 64.29 3 6-1/3-2 NHT1s_100 85.71 2 NHT2s_100 42.86 HT_100 Mutex: % of threads completing at each stage Semaphore has the same behavior It might make sense to not spin in mutex/semaphore when critical section large Most threads block during syscall even after spinning Mutex and semaphore same behavior 9/18/2018

Config CS size Max spin count NHT1s 1000 3719 NHT2s 4275 HT1s 3242 100 50 151 HT 28 Mutex and semaphore same behavior Spinlock: spin count variation with thread placement 9/18/2018

Variation of max and min overheads as number of threads vary
Semaphore wait latency for Critical section size 1000 9/18/2018

Variation of max and min overhead as number of threads vary
For 14 threads, semaphore wait latency across sockets is worse For smaller number of threads, inter-socket wait latency can be smaller Timeline shows threads across sockets are scheduled late, resulting in lesser contention If threads on hyperthreaded core, wait latency can be smaller Compared to wait latency of threads on non-hyperthreaded cores The behavior of mutex is similar to that of semaphore 9/18/2018

How do mutexes, spinlocks and binary sems compare with each other?
Mutexes and binary semaphores have similar latency Critical Section – 100 LOOPCOUNT 9/18/2018

How do mutexes, spinlocks and binary sems compare with each other?
Do not use spin locks if there are lots of threads for small CS Critical Section – 100 LOOPCOUNT 9/18/2018

What about semaphore post & mutex unlock?
Post/unlock’s latency increases linearly with scale 9/18/2018

Other Observations Locks are given to threads in a cluster format
Inter socket experiments, the locks are acquired by threads belonging to the same socket (3-4 threads in one go) 9/18/2018

Conclusion Synchronization is hard
Basic hardware artifacts are closely tied with the software synchronization primitives Inter-socket performance is usually worse than same socket and hyper-threading To get the best performance from software, you should know everything about the architecture But if you have a Haswell – EP machine, use our slides 

Backup?

Variation of max and min overheads as thread placement varies
Inter-socket worst sem wait latency is worse 9/18/2018

Mutex 1st CAS 2nd CAS 3rd CAS while trylock Syscall Spin didn’t complete Num syscalls NHT1s_1000 21.43 78.57 1 7-1/4-2 NHT2s_1000 14.29 7.14 4-1/6-2/1-3 HT_1000 64.29 3 6-1/3-2 NHT1s_100 85.71 2 NHT2s_100 42.86 HT_100 Semaphore 1st CAS TRY Spin didn’t complete Num syscalls NHT1s_1000 1 8-1/3-2 NHT2s_1000 9-1/3-2 HT_1000 9-1/2-2 NHT1s_100 NHT2s_100 HT_100 Spinlock Max spin count NHT1s_1000 3719 NHT2s_1000 4275 HT_1000 3242 NHT1s_100 50 NHT2s_100 151 HT_100 28 Mutex and semaphore same behavior Does it makes sense to spin in mutex/semaphore? 9/18/2018

Anshu Raina Suhas Pai Mohit Verma Vikas Goel Yuvraj Patel

Similar presentations

Presentation on theme: "Anshu Raina Suhas Pai Mohit Verma Vikas Goel Yuvraj Patel"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Anshu Raina Suhas Pai Mohit Verma Vikas Goel Yuvraj Patel

Similar presentations

Presentation on theme: "Anshu Raina Suhas Pai Mohit Verma Vikas Goel Yuvraj Patel"— Presentation transcript:

Similar presentations

About project

Feedback