Practical Reduction for Store Buffers Ernie Cohen, Microsoft Norbert Schirmer, DFKI
problem practical reasoning about imperative code is based on state assertions and invariants such reasoning tacitly assumes sequential consistency (SC) … … but real MP hardware doesn’t provide SC needed: a programming discipline that – guarantees SC – is flexible enough to handle real software – is practical to check
x86/x64 hardware model: TSO FIFO store buffer (SB) between each processor (P) and the (shared, SC) memory – P writes are queued onto its SB – concurrently, writes leave SBs and are applied to memory – a read by P reads from P’s SB if possible; otherwise, it reads from memory (“SB forwarding”) – P can flush its own SB (expensive) note: TSO != “load-acquire, store-release” – a read can move backward past a write to the same location, turning into a read of a constant note: UP TSO machines are SC, but …
TSO is not SC TSO is not SC, because of the delay in writes becoming visible to other processors, e.g. P0: P1: both Ps can complete under TSO, but not under SC (whichever thread writes second gets stuck)
a simple SC discipline make sure that P reads only when P’s SB is empty – writes dirty the SB; flushes clean it – read allowed only when the SB is clean – (lazy caching uses a similar trick to achieve SC) proof of SC: – each P simulates a virtual P (that might fall behind) – virtual P takes a write step when that write hits memory – real and virtual P are in sync on read steps but this discipline isn’t practical – disjoint concurrency shouldn’t require any flushes! idea: distinguish private and shared memory
ownership each location can be either owned (by a unique processor) or unowned each access is volatile or nonvolatile modified discipline: – nonvolatile access requires ownership of the location – volatile writes dirty the SB – volatile reads allowed only when SB is clean simulation proof is similar, but novolatile accesses happen as soon as there are no volatile writes in front of them – they’re guaranteed to see the same values when they hit the SB, because other Ps don’t modify
moving ownership around use ghost operations to take and release ownership – P can take ownership of unowned locations – P can release ownership of locations P owns (this fits with ownership in VCC, where “unowned” means owned by a data object rather than a thread) discipline in the paper also adds unowned read- only locations, which allows shared non-volatile reading
ex: spinlocks typedef … struct _SPIN_LOCK { volatile int Lock; _(ghost \object prot_obj;) _(invariant !Lock ==> \mine(prot_obj)) } SPIN_LOCK; void Acquire(SPIN_LOCK *SpinLock …) … { int stop; do { …{ //atomic stop = (__interlockedcompareexchange(&SpinLock->Lock, 1, 0) == 0); _(if (stop) \giveup_closed_owner(SpinLock->prot_obj, SpinLock);) } } while (!stop); } Microsoft confidential
key points discipline follows some basic VCC methodology – discipline expressed in terms of ghost state – ghost code “witnesses” conformance to the discipline (much as ghost code is used to witness simulations) – by replacing proof obligations with programming obligations, we’re more likely to get programmers to do it when checking the discipline, we get to assume a SC execution, so we never have to think about the SBs.
the only tricky part of the proof key observation: ownership changes cannot race on their own – if they do, there are executions that violate the discipline therefore, we can pretend that ownership doesn’t get released until the next volatile write
a note on ghosts VCC requires lots of ghost code, incude racy operations on volatile ghost state why doesn’t this introduce flushing? SC code follows discipline on real data => {SC stripped code simulates SC code} SC stripped code follows discipline on real data => {reduction theorem} stripped code simulates SC stripped code => {SC stripped code simulates SC code} stripped code simulates SC code
how close is this to practice? discipline followed almost everywhere in the Hv codebase – even non-interlocked volatile writes are fairly rare exceptions (outside of device ops) are writes where – the write doesn’t race with other writes – racing reads can safely read the old value – ex: releasing a spinlock, broadcasting signals a solution: introduce a new kind of volatile – one reader, multiple writers – keep track of an upper and lower bound – writes must be above the upper bound, – writes raise the upper bound – flush raises the lower bound to the upper bound – reads by other processors raise the lower bound to the value read – this works, but is kind of gross