Extended Memory Semantics for Thread Synchronization Sheng Li, Ying Zhou Operating System Progress Report Nov 1 st, 2007 Sheng Li, Ying Zhou Operating System Progress Report Nov 1 st, 2007
2 Problems Hardware multithreading is no longer a privilege of supercomputing, it is already part of the major microprocessors. E.g. In Sun Niagara 2 has 64 threads/chip and 256 threads/server. Concurrency management is one of the biggest challenges in multithreaded system Key requirement: Low overhead and scalable thread synchronization Synchronization mechanisms Atomic primitives (Test-and-Set, Compare-and-Swap, LL-SC) Software routines built on them have poor performance and scalability Empty/Full bits, using extension bit for each memory location to denote the empty/full state. Better performance [1], but still not enough
Nov 1 st, Our Goal Solve the synchronization bottleneck by using Extended Memory Semantics Better performance and scalability Quantify the performance gain when using EMS, compared to other synchronization mechanisms (e.g Empty/Full bits)
Nov 1 st, Extended Memory Semantics Memory instructions are characterized synchronization behavior. Load.ff, Load.fe, Store.xf, Store.ef, Store.xe. (F--- Full, e--- empty, x---don’t care) 64 bits of data/metadata Extension bit
Nov 1 st, EMS handler There is no free lunch… EMS handler has overhead Creating the handler threads To queue up memory requests, to build the data structure
Nov 1 st, What we have done so far Build the EMS model on both architecture and OS aspects in the Structural Simulation Toolkit (SST) SST is the simulation environment for massively lightweight multithreading, developed at Notre Dame and Sandia Lab Modified the glibc to use EMS Especially pthread library Design benchmarks for different categories Run the simulations to evaluate EMS performance
Nov 1 st, Tightly Coupled Parallel Each thread competes with the others for the only lock before updating the counter Very high contention, worst case
Nov 1 st, Loosely Coupled Parallel Each thread competes locks with the others before updating the counters. Mild contention
Nov 1 st, Embarrassingly Parallel No contention, no locks
Nov 1 st, Embarrassingly parallel and loosely coupled parallel Low synchronization overhead--- guaranteed by EMS EMS shows very good scalability Synchronization distribution
Nov 1 st, Tightly Coupled Parallel Bad performance for EMS in the worst case Most of threads are used for synchronization, not for real job
Nov 1 st, The Road Ahead Build/complete other synchronization mechanisms (e.g. Empty/Full bits and etc) into SST Modify glibc to make it support for other synchronization mechanisms Compare performance between EMS and other synchronization mechanisms
Nov 1 st, Thank you! Questions?
Nov 1 st, Bibliography [1] Performance and Programming Experience on the Tera MTA, Larry Carter, John Feo, Allan Snavely, PPSC, 1999
Nov 1 st, Back up Slides
Nov 1 st, Lightweight Threads Thread context (frame) is 32 double words (256 bytes) Two double words are reserved for the thread status; 30 general purpose registers. No other per thread state, easy for multithreading. Frames are stored in memory (No Register File) Registers are aliases for memory locations