Multicore programming

Multicore programming
Transactional memory Week 5 – Wednesday Trevor Brown

Announcements I moved presentations as appropriate to dodge Donald Knuth’s lecture See the new schedule Submitting assignments / paper reviews / presentation feedback Piazza does not allow students to make private replies to posts But it does allow students to make private posts Make a private post in the designated folder

Review for assignment: Overview of Proving linearizability
Goal: argue that every concurrent execution E of an algorithm has an equivalent linearized (sequential) execution L Equivalent means the two executions have the same operations, and each operation returns the same value in E and L Where does the linearized execution L come from? Choose linearization points for your algorithm The linearization point for an operation must occur during the operation You might be able to explicitly give a known linearization point, or you may have to argue that a correct linearization point exists L is just E, except each operation is executed atomically at its linearization point What happens after you pick linearization points? You have to prove that each operation in E returns the same value as it does in L

Review for assignment: Mechanics of Proving linearizability
Let E be any execution Let L be the linearized execution obtained from E by executing each operation atomically at its linearization point Let O be any operation in E, VE be the value it returns in E, and VL be the value it returns in L Want to prove: VE = VL What information can you use to prove this? E is just “any” execution, O is just “any” operation Not much execution-specific detail to grab onto Must use facts that are true for every execution Since O is “any” operation, you end up proving results for all operations. Since E is “any” execution, you end up proving results for all executions.

Last time Implementing KCAS Didn’t quite finish

Exactly How does KCAS help?

Recall: our KCAS doubly-linked list
pred succ after When deleting a node, use KCAS to also mark that node When modifying or deleting any node, use KCAS to verify the node is not marked Helps us do complex updates easily Delete(17) 15 17 20 X pred succ Insert(17) 15 20 n 17

Recall: KCAS did not help us prove that searches work
pair<node, node> InternalSearch(key_t k) 1 pred = head 2 succ = head 3 while (true) 4 if (succ == NULL or succ.key >= k) return make_pair(pred, succ); 6 pred = succ; 7 succ = KCASRead(succ.next); Where to linearize Contains that returns true? Prove there exists a time during InternalSearch when succ is in the list, and linearize then bool Contains(key_t k) 8 pred, succ = InternalSearch(k); 9 return (succ.key == k); Where to linearize Contains that returns false? Prove there exists a time during InternalSearch when: pred and succ were both in the list and pred.next = succ. Linearize then.

Usefulness of KCAS Lock-based algorithms that do not lock while searching also have this same challenge: proving correctness for searches is hard. KCAS is an awesome tool, but it doesn’t solve everything It makes it easy to change multiple addresses atomically (and in a lock-free way) Locks do this too (but not in a lock-free way) Implement KCAS with locks, if you like (can also be fast) It does not always make it easy to argue searches work And searches are part of updates! So, we still need ad-hoc correctness arguments for updates Open question: how to get fast data structures with easy/trivial correctness proofs for searches (and the search/traversal part of updates)? This is why I think lock-free algorithms and fast lock-based ones are very similar (in performance and complexity).

This time A technology that can help with correctness arguments for searches Implemented in some modern hardware Recent Intel IBM POWER8+ (not as good as Intel’s implementation) Soon ARM (how good will it be?) Can also be used to greatly accelerate some algorithms such as KCAS And can even accelerate lock-based algorithms

transactional memory (TM)
Allows a programmer to perform arbitrary blocks of code atomically Note: locks also do this (just not always efficiently or easily) bool transfer(int *src, int *dst, int amt) bool result = false; atomic { if (*src > amt) { *src -= amt; *dst += amt; result = true; } return result;

Definitions Each transaction commits or aborts
Commit: as if the entire transaction happened atomically Abort: as if the transaction never happened at all Read-set: the set of all addresses read by a transaction Write-set: the set of all addresses written by a transaction Data-set: (read-set) + (write-set) Data conflicts: two concurrent transactions have a data conflict if the write-set of one intersects the data-set of the other (examples soon)

Transactional operations
Studying Intel’s hardware implementation of TM xbegin: start a new transaction and return XSTARTED xend: try to commit the transaction (may abort the transaction) xabort: abort the transaction read *addr: Read & add addr to the transaction’s read-set (L3 cache) write *addr = val: Write & add addr to the transaction’s write-set (L1 cache) Note: xbegin, xend, xabort are actual x86/64 assembly instructions Instruction set: TSX-NI / RTM (provided in several modern Intel chips)

High level idea Transaction works sort of like a lock, but can abort
Suppose thread p reads *src at line 3, then thread q modifies *src This causes a data conflict, and p’s transaction aborts (since its view of memory might no longer be atomic) bool transfer(int *src, int *dst, int amt) 1 bool result = false; 2 xbegin(); 3 if (*src > amt) { 4 *src -= amt; 5 *dst += amt; 6 result = true; 7 } 8 xend(); 9 return result; Work might not be done because of an abort Must handle aborts somehow

A bit more detail on Intel’s Hardware TM (HTM)
Threads can perform transactions that read/write any address They can also read/write/CAS/fetch&add/etc. any address without using transactions Transactions abort as soon as there is a data conflict If a transaction T reads or writes an address, and another thread then writes to that address, T will abort Transactions can abort at any time, for any reason

What happens when a transaction Aborts
p: jumps back to xbegin, which returns XABORTED p: execute xbegin, which returns XBEGIN_STARTED When a transaction aborts, the thread jumps to its last xbegin, and this xbegin returns XABORTED bool transfer(int *src, int *dst, int amt) 1 bool result = false; 2 xbegin(); 3 if (*src > amt) { 4 *src -= amt; 5 *dst += amt; 6 result = true; 7 } 8 xend(); 9 return result; p: read *src p: write to *src p: write to *dst q: write to *src

Handling aborts Branch based on the return value of xbegin
Handle abort in else case Useful to record # of aborts, debug, change code behaviour, etc. Usually desirable to retry aborted transactions bool transfer(int *src, int *dst, int amt) 1 bool result = false; 2 retry: 3 if (xbegin() == XSTARTED) { 4 if (*src > amt) { 5 *src -= amt; 6 *dst += amt; 7 result = true; 8 } 9 xend(); 10 } else { // we aborted 11 handleTheAbort(); 12 goto retry; 13 } 14 return result;

Example: transactional hash table
int sequentialInsert(int key) int insert(int key) 1 int h = hash(key); 2 for (int i=0;i<capacity;++i) { 3 | int index = (h+i) % capacity; 4 | int found = data[index]; 5 | if (found == key) { 6 | return false; 7 | } else if (found == NULL) { 8 | data[index] = key; 9 | return true; 10 | } 11 } 12 return FULL; 1 retry: 2 if (xbegin() == XSTARTED) { 3 int result = sequentialInsert(key); 4 xend(); 5 return result; 6 } else { 7 // transaction aborted 8 goto retry; 9 }

The problem with HTM Transactions can abort for any reason
No progress guarantee! Not hard to write code in which all transactions abort forever Example: if a transaction causes a page fault, it will abort, but the page fault will not be served! So, if the transaction retries, it will abort again! Need to provide a fallback code path (e.g., using locks) to run when a transaction aborts too many times Two code paths: fast path (using HTM), fallback path

Transactional lock elision (TLE)
Easiest and most common choice for fallback code path: Acquire a global lock then execute the transaction’s code (without xbegin/xend) Transactions on the fast path should not run while the global lock is held To prevents transactions from changing data that the global lock is supposed to protect So, on the fast path, each transaction reads the lock state If it is locked, the transaction aborts, and the thread waits until the lock is free to try again If it is not locked, the transaction proceeds If the lock is acquired at any time during the transaction, this will be a data conflict, and the transaction will abort! Note: transactions only need to read the lock to ensure that the operation succeeds only if no one else holds the lock. Without HTM, you would need to acquire the lock to guarantee no one else holds it when you change the data structure.

Example: TLE-based hash table
Why does TLE work? int insert(int key) 1 int retriesLeft = 5; 2 retry: 3 if (xbegin() == XSTARTED) { 4 if (locked) xabort(); 5 int result = sequentialInsert(key); 6 xend(); 7 return result; 8 } else { 9 // transaction aborted 10 while (locked) { /* wait */ } 11 if (--retriesLeft > 0) goto retry; 12 acquire(&locked); 13 int result = sequentialInsert(key); 14 release(&locked); 15 return result; 16 } What do we know about a transaction that commits? If it read an address, and then that address was changed, it would have aborted. Fast path (transactions) So, the system behaviour is the same as it would be if the transaction had acquired locks on all of the addresses it observed! Fallback path (global lock) What about the fallback path? We hold the global lock, so no one else can access anything. Equivalent to running in a single threaded system.

External Binary search tree (BST) [Also called leaf-oriented]
Leaves contain real keys Internal nodes contain dummy keys Insert: add leaf & internal Delete: remove leaf & internal Why use them? Simpler deletion than traditional internal (node-oriented) BSTs 6 Insert 7 Delete 2 4 9 9 2 5 7 4 5

TLE-based external BST
Identical code to the hash table! TLE is basically plug-and-play. TLE-based external BST int seqInsert(int key) int insert(int key) 1 auto p = root; 2 auto l = root; 3 while (isLeaf(l)) { 4 p = l; 5 l = (key < p.key) ? p.left : p.right; 6 } 7 if (key == l.key) { // found the key 8 return false; 9 } else { 10 auto newParent = new parent and child 11 if (key < p.key) p.left = newParent; 12 else p.right = newParent; 13 return true; 14 } 1 int retries = 5; 2 retry: 3 if (xbegin() == XSTARTED) { 4 if (locked) xabort(); 5 int result = seqInsert(key); 6 xend(); 7 return result; 8 } else { 9 // transaction aborted 10 while (locked) { /* wait */ } 11 if (--retries > 0) goto retry; 12 acquire(&locked); 13 int result = seqInsert(key); 14 release(&locked); 15 return result; 16 }

Performance of TLE-based BST
BST 50% insert, 50% delete, key range [0, 105) asdf Updates on uniform random keys Why does TLE perform well? Operations are low contention 99.99% complete on the fast path TLE +77% operations per second Lock-free algorithm concurrent threads

Can we do better than TLE?
When TLE falls flat 1 thread does “heavy” operations that likely run on the fallback path Why does TLE perform worst? Concurrency bottleneck! 80% of heavy ops run on fallback lock held  47 processes must wait! BST similar workload, key range [0, 10^5) asdf TLE Lock-free algorithm 2-path con operations per second What makes an operation “heavy?” TLE Original template High likelihood of aborts; risk factors: Data conflicts (you read, then others write) Writing too much (max L1 cache size) [L1 is divided between hyperthreads] Reading too much (max L3 cache size) False sharing Illegal instructions (syscalls, …) concurrent processes

Multicore programming

Similar presentations

Presentation on theme: "Multicore programming"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Multicore programming

Similar presentations

Presentation on theme: "Multicore programming"— Presentation transcript:

Similar presentations

About project

Feedback