Multicore programming

Multicore programming
Implementing KCAS Week 5 – Monday Trevor Brown

Announcements Course Piazza Next week’s paper presenters
Sign up Choose your presentation paper there Ask questions there (unless it doesn’t seem appropriate to) Some questions already clarified Next week’s paper presenters Meet me ASAP to discuss a rough outline of your talk!! Make a Piazza post for your paper (as per instructions on my site) Students not presenting next week You have several responsibilities you will be graded on (see my site!)

Last time Lock-free double-compare single swap (DCSS)
Restriction on its usage: Consider DCSS(addr1, addr2, exp1, exp2, new2) addr1 cannot be a field that is ever modified by DCSS Started to see how to implement KCAS Today, we finish this implementation

Building KCAS from DCSS [Harris2002]
Facilitate helping with KCAS descriptor, which stores n rows containing: addr, exp, new KCAS descriptor also contains a status field, with a value in {Undecided, Succeeded, Failed} The status field helps coordinate threads Prevents scenarios where different threads helping a KCAS have different views of memory, and one thinks the KCAS is finished, while another thinks it is still ongoing (and incorrectly makes changes twice, etc.) KCAS descriptor status n addr1 exp1 new1 addr2 exp2 new2 …

KCAS algorithm idea Proceeds in two phases
Phase 1: lock-free “locking” Iterate over the addresses, attempting to change each address from its expected value to a pointer d to the KCAS descriptor If we see an unexpected value, then status changes to Failed, otherwise it changes to Succeeded Phase 2: completion (commit or abort) Iterate over the addresses, attempting to change each address from d to either its new value, or its expected value, respectively, depending on whether status is Succeeded or Failed

KCAS Doubly-linked list example: successful KCAS
pred succ after Delete(17) 15 17 20 X KCAS descriptor d status = Undecided n = 5 &pred.next succ after &after.prev pred &pred.mark false &succ.mark true &after.mark Succeeded

KCAS Doubly-linked list example: Failed KCAS
pred succ after Delete(17) 15 17 20 X KCAS descriptor d status = Undecided n = 5 &pred.next succ after &after.prev pred &pred.mark false &succ.mark true &after.mark Failed to change succ.mark from false to point to d! Failed CAS from d back to expected values

Keeping helper threads in sync
Key ideas: In phase 1 (lock-free “locking”), helpers compete to CAS the status from Undecided to Succeeded or Failed Only one helper can “win” and change status Once the status is Succeeded or Failed, no more “locking” should happen I.e., helpers should no longer change addresses to point to the KCAS descriptor Accomplish this with DCSS! How? In phase 2 (completion), all helpers agree on whether to change addresses to new values, or back to expected values

Using dcss in the “locking” phase
Threads use DCSS to “lock” addresses (storing a pointer to a KCAS descriptor) DCSS addr1 = status field of the KCAS descriptor DCSS exp1 = Undecided DCSS addr2 = address to be “locked” for the KCAS DCSS exp2 = expected value for the address according to the arguments of KCAS DCSS new2 = pointer to the KCAS descriptor Semantics of DCSS guarantee: KCAS will only successfully “lock” an address if the KCAS status is still Undecided (Without this guarantee, something called an ABA problem can occur. More on this later.)

Distinguishing between descriptors
Now that we have DCSS descriptors and KCAS descriptors, we must be able to distinguish between them Steal another bit from each word (DCSS uses the least significant bit, KCAS uses 2nd-least significant) The two least significant bits tell us whether an address contains a value, DCSS descriptor, or KCAS descriptor pack(v): shift v left by 2 then OR with 1 [making v “look like” a DCSS descriptor pointer] packKCAS(v): shift v left by 2 then OR with 2 [making v “look like” a KCAS descriptor ptr] unpack(v): AND v with ~0x3 then shift right by 2 (inverting both pack and packKCAS)

Linearize at last read here
Implementation Data structures Code struct KCAS_desc { bool KCAS(addr1, ..., exp1, ..., new1, ...) word_t volatile status; word_t n; KCAS_row row[K]; char padding[64]; } __attribute__ ((aligned(64))); 1 KCAS_desc * d = new KCAS_desc(addr1, ...); 2 d->status = Undecided; 3 SortRowsByAddress(d); // can skip usually 4 return KCASHelp(d); word_t KCASRead(word_t volatile * addr) struct KCAS_row { 5 word_t v; 6 do { 7 v = DCSSRead(addr); 8 if (isKCAS(v)) KCASHelp(unpack(v)); 9 } while (isKCAS(v)); 10 return v; word_t volatile * addr; word_t exp; word_t new; }; Linearize at last read here

bool KCASHelp(KCAS_desc * d)
11 int newStatus; 12 if (d->status == Undecided) 13 | newStatus = Succeeded; 14 | for (int i = 0; i < d->n; i++) 15 | | word_t val2 = DCSS(&d->status, d->row[i].addr, | | Undecided, d->row[i].exp, | | packKCAS(d)); 16 | | if (val2 != d->row[i].exp) // if DCSS failed 17 | | | if (isKCAS(val2)) // because of a KCAS 18 | | | if (unpack(val2) != d) // a DIFFERENT KCAS 19 | | | KCASHelp(unpack(val2)); 20 | | | i; continue; // retry "locking" this addr 21 | | | // else another helper "locked" for us 22 | | | else // addr does not contain its exp value 23 | | | newStatus = Failed; break; 24 | CAS(&d->status, Undecided, newStatus); 25 bool succ = (d->status == Succeeded); 26 for (int i = 0; i < d->n; i++) 27 | val = (succ) ? d->row[i].new : d->row[i].exp; 28 | CAS(d->row[i].addr, packKCAS(d), val); 29 return succ; Use DCSS to change addresses to point to the KCAS descriptor Phase 1: lock-free “locking” Status CAS Recall: KCAS just returns KCASHelp(d). Where should we linearize a successful KCAS? Phase 2: completion

bool KCASHelp(KCAS_desc * d)
11 int newStatus; 12 if (d->status == Undecided) 13 | newStatus = Succeeded; 14 | for (int i = 0; i < d->n; i++) 15 | | word_t val2 = DCSS(&d->status, d->row[i].addr, | | Undecided, d->row[i].exp, | | packKCAS(d)); 16 | | if (val2 != d->row[i].exp) // if DCSS failed 17 | | | if (isKCAS(val2)) // because of a KCAS 18 | | | if (unpack(val2) != d) // a DIFFERENT KCAS 19 | | | KCASHelp(unpack(val2)); 20 | | | i; continue; // retry "locking" this addr 21 | | | // else another helper "locked" for us 22 | | | else // addr does not contain its exp value 23 | | | newStatus = Failed; break; 24 | CAS(&d->status, Undecided, newStatus); 25 bool succ = (d->status == Succeeded); 26 for (int i = 0; i < d->n; i++) 27 | val = (succ) ? d->row[i].new : d->row[i].exp; 28 | CAS(d->row[i].addr, packKCAS(d), val); 29 return succ; Recall: KCAS just returns KCASHelp(d) Where should we linearize a KCAS? At the status CAS! The behaviour of all helper threads, and hence, the outcome of the KCAS, is decided there. (Crucial point: everything is “locked” at that time.) Why does this work? Complicated argument! Model checking + proof sketch in paper. Deeper than we need to go.

Overheads in this implementation
Allocating and reclaiming descriptors is expensive Each KCAS allocates one KCAS descriptor, and at least k DCSS descriptors Must perform at least k DCSS operations (to “lock”), k CAS instructions (to change each address to its new / expected value), and one more CAS (to change status) Each DCSS requires at least 2 CAS instructions That’s a total of 3k+1 CAS instructions 5 word update in a doubly linked list  6 descriptors and 16 CAS instructions!

Eliminating descriptor alloc/free
Paper by me and Maya Arbel-Raviv at DISC 2017 Each thread can reuse one KCAS descriptor and one DCSS descriptor Requires care to avoid reading invalid descriptors (and doing CAS with malformed arguments) helping old (finished) operations by storing pointers to their descriptors, which now represent a new (different) operation

Synthetic KCAS benchmark
10 randomized trials in which: n threads repeatedly do the following for 3 seconds Pick K uniform random slots in an array Read integers stored in those slots Do a KCAS to change each of the K slots from the value exp that we read, to a new value of exp + 1 Report average throughput (KCAS operations/sec) over all trials 7 11 5 3 4 9 11 10 15 14 8 10 9 5 9 10 4 11 14 13 6 12

Sanity checking the experiment
Important to perform sanity checks wherever you can! Helps to catch obvious (and non-obvious) mistakes One good sanity check: checksum based validation Reduce the data structure to a number (a data structure checksum) Reduce each threads’ completed operations to a number (a thread checksum) verify that thread checksums “match” the data structure checksum (I.e., the work the threads think they’ve done is reflected in the data structure!) Creativity needed to come up with good checksum functions

checksum validation for our benchmark
Data structure checksum Sum of all array entries Each successful KCAS increments K array slots by 1 Adds K to the data structure checksum Thread checksum K*X where X = # of successful KCAS operations performed by the thread Validation sum(thread checksums) == data structure checksum (If a KCAS operation is lost, or screws up the array, validation [hopefully] fails)

Experimental System 2x Intel E7-4830 v3
12 cores (24 threads) per socket 128GB RAM Ubuntu LTS, G –O3 Fast allocator: jemalloc 4.2.1

operations per microsecond
Results operations per microsecond better concurrent threads

How much does reusing descriptors help with memory consumption?
for descriptors (bytes) peak memory usage better concurrent threads

Multicore programming

Similar presentations

Presentation on theme: "Multicore programming"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Multicore programming

Similar presentations

Presentation on theme: "Multicore programming"— Presentation transcript:

Similar presentations

About project

Feedback