Optimistic Intra-Transaction Parallelism using Thread Level Speculation Chris Colohan 1, Anastassia Ailamaki 1, J. Gregory Steffan 2 and Todd C. Mowry 1,3 1 Carnegie Mellon University 2 University of Toronto 3 Intel Research Pittsburgh
2 Chip Multiprocessors are Here! 2 cores now, soon will have 4, 8, 16, or 32 Multiple threads per core How do we best use them? IBM Power 5 AMD Opteron Intel Yonah
3 Multi-Core Enhances Throughput Database ServerUsers TransactionsDBMSDatabase Cores can run concurrent transactions and improve throughput
4 Using Multiple Cores Database ServerUsers TransactionsDBMSDatabase Can multiple cores improve transaction latency? Can multiple cores improve transaction latency?
5 Parallelizing transactions SELECT cust_info FROM customer; UPDATE district WITH order_id; INSERT order_id INTO new_order; foreach(item) { GET quantity FROM stock; quantity--; UPDATE stock WITH quantity; INSERT item INTO order_line; } DBMS Intra-query parallelism Used for long-running queries (decision support) Does not work for short queries Short queries dominate in commercial workloads
6 Parallelizing transactions SELECT cust_info FROM customer; UPDATE district WITH order_id; INSERT order_id INTO new_order; foreach(item) { GET quantity FROM stock; quantity--; UPDATE stock WITH quantity; INSERT item INTO order_line; } DBMS Intra-transaction parallelism Each thread spans multiple queries Hard to add to existing systems! Need to change interface, add latches and locks, worry about correctness of parallel execution…
7 Parallelizing transactions SELECT cust_info FROM customer; UPDATE district WITH order_id; INSERT order_id INTO new_order; foreach(item) { GET quantity FROM stock; quantity--; UPDATE stock WITH quantity; INSERT item INTO order_line; } DBMS Intra-transaction parallelism Breaks transaction into threads Hard to add to existing systems! Need to change interface, add latches and locks, worry about correctness of parallel execution… Thread Level Speculation (TLS) makes parallelization easier. Thread Level Speculation (TLS) makes parallelization easier.
8 Thread Level Speculation (TLS) *p= *q= =*p =*q Sequential Time Parallel *p= *q= =*p =*q =*p =*q Epoch 1Epoch 2
9 Thread Level Speculation (TLS) *p= *q= =*p =*q Sequential Time *p= *q= =*p R2 Violation! =*p =*q Parallel Use epochs Detect violations Restart to recover Buffer state Oldest epoch: Never restarts No buffering Worst case: Sequential Best case: Fully parallel Data dependences limit performance. Epoch 1Epoch 2
10 Transaction Programmer DBMS Programmer Hardware Developer A Coordinated Effort Choose epoch boundaries Remove performance bottlenecks Add TLS support to architecture
11 So what’s new? Intra-transaction parallelism Without changing the transactions With minor changes to the DBMS Without having to worry about locking Without introducing concurrency bugs With good performance Halve transaction latency on four cores
12 Related Work Optimistic Concurrency Control (Kung82) Sagas (Molina&Salem87) Transaction chopping (Shasha95)
13 Outline Introduction Related work Dividing transactions into epochs Removing bottlenecks in the DBMS Results Conclusions
14 Case Study: New Order (TPC-C) Only dependence is the quantity field Very unlikely to occur (1/100,000) GET cust_info FROM customer; UPDATE district WITH order_id; INSERT order_id INTO new_order; foreach(item) { GET quantity FROM stock WHERE i_id=item; UPDATE stock WITH quantity-1 WHERE i_id=item; INSERT item INTO order_line; } 78% of transaction execution time
15 Case Study: New Order (TPC-C) GET cust_info FROM customer; UPDATE district WITH order_id; INSERT order_id INTO new_order; foreach(item) { GET quantity FROM stock WHERE i_id=item; UPDATE stock WITH quantity-1 WHERE i_id=item; INSERT item INTO order_line; } GET cust_info FROM customer; UPDATE district WITH order_id; INSERT order_id INTO new_order; TLS_foreach (item) { GET quantity FROM stock WHERE i_id=item; UPDATE stock WITH quantity-1 WHERE i_id=item; INSERT item INTO order_line; } GET cust_info FROM customer; UPDATE district WITH order_id; INSERT order_id INTO new_order; TLS_foreach (item) { GET quantity FROM stock WHERE i_id=item; UPDATE stock WITH quantity-1 WHERE i_id=item; INSERT item INTO order_line; }
16 Outline Introduction Related work Dividing transactions into epochs Removing bottlenecks in the DBMS Results Conclusions
17 Dependences in DBMS Time
18 Dependences in DBMS Time Dependences serialize execution! Example: statistics gathering pages_pinned++ TLS maintains serial ordering of increments To remove, use per-CPU counters Performance tuning: Profile execution Remove bottleneck dependence Repeat
19 Buffer Pool Management CPU Buffer Pool get_page(5) ref: 1 put_page(5) ref: 0
20 get_page(5) put_page(5) Buffer Pool Management CPU Buffer Pool get_page(5) ref: 0 put_page(5) Time get_page(5) put_page(5) TLS ensures first epoch gets page first. Who cares? TLS ensures first epoch gets page first. Who cares? TLS maintains original load/store order Sometimes this is not needed get_page(5) put_page(5)
21 Buffer Pool Management CPU Buffer Pool get_page(5) ref: 0 put_page(5) Time get_page(5) put_page(5) = Escape Speculation Isolated: undoing get_page will not affect other transactions Undoable: have an operation ( put_page ) which returns the system to its initial state Isolated: undoing get_page will not affect other transactions Undoable: have an operation ( put_page ) which returns the system to its initial state Escape speculation Invoke operation Store undo function Resume speculation Escape speculation Invoke operation Store undo function Resume speculation get_page(5) put_page(5) get_page(5)
22 Buffer Pool Management CPU Buffer Pool get_page(5) ref: 0 put_page(5) Time get_page(5) put_page(5) get_page(5) put_page(5) Not undoable! get_page(5) put_page(5) = Escape Speculation
23 Buffer Pool Management CPU Buffer Pool get_page(5) ref: 0 put_page(5) Time get_page(5) put_page(5) get_page(5) put_page(5) Delay put_page until end of epoch Avoid dependence = Escape Speculation
24 Removing Bottleneck Dependences We introduce three techniques: Delay operations until non-speculative Mutex and lock acquire and release Buffer pool, memory, and cursor release Log sequence number assignment Escape speculation Buffer pool, memory, and cursor allocation Traditional parallelization Memory allocation, cursor pool, error checks, false sharing
25 Outline Introduction Related work Dividing transactions into epochs Removing bottlenecks in the DBMS Results Conclusions
26 Experimental Setup Detailed simulation Superscalar, out-of-order, 128 entry reorder buffer Memory hierarchy modeled in detail TPC-C transactions on BerkeleyDB In-core database Single user Single warehouse Measure interval of 100 transactions Measuring latency not throughput CPU 32KB 4-way L1 $ Rest of memory system CPU 32KB 4-way L1 $ CPU 32KB 4-way L1 $ CPU 32KB 4-way L1 $ 2MB 4-way L2 $
27 Optimizing the DBMS: New Order Sequential No Optimizations Latches Locks Malloc/Free Buffer Pool Cursor Queue Error Checks False Sharing B-Tree Logging Time (normalized) Idle CPU Violated Cache Miss Busy Cache misses increase Other CPUs not helping Can’t optimize much more 26% improvement
28 Optimizing the DBMS: New Order Sequential No Optimizations Latches Locks Malloc/Free Buffer Pool Cursor Queue Error Checks False Sharing B-Tree Logging Time (normalized) Idle CPU Violated Cache Miss Busy This process took me 30 days and <1200 lines of code. This process took me 30 days and <1200 lines of code.
29 Other TPC-C Transactions New OrderDeliveryStock LevelPaymentOrder Status Time (normalized) Idle CPU Failed Cache Miss Busy 3/5 Transactions speed up by 46-66%
30 Conclusions TLS makes intra-transaction parallelism practical Reasonable changes to transaction, DBMS, and hardware Halve transaction latency
31 Needed backup slides (not done yet) 2 proc. Results Shared caches may change how you want to extract parallelism! Just have lots of transactions: no sharing TLS may have more sharing
Any questions? For more information, see:
Backup Slides Follow…
LATCHES
35 Latches Mutual exclusion between transactions Cause violations between epochs Read-test-write cycle RAW Not needed between epochs TLS already provides mutual exclusion!
36 Latches: Aggressive Acquire Acquire latch_cnt++ …work… latch_cnt-- Homefree latch_cnt++ …work… (enqueue release) Commit work latch_cnt-- Homefree latch_cnt++ …work… (enqueue release) Commit work latch_cnt-- Release Large critical section
37 Latches: Lazy Acquire Acquire …work… Release Homefree (enqueue acquire) …work… (enqueue release) Acquire Commit work Release Homefree (enqueue acquire) …work… (enqueue release) Acquire Commit work Release Small critical sections
HARDWARE
39 TLS in Database Systems Non-Database TLS Time TLS in Database Systems Large epochs: More dependences Must tolerate More state Bigger buffers Large epochs: More dependences Must tolerate More state Bigger buffers Concurrent transactions
40 Feedback Loop I know this is parallel! for() { do_work(); } par_for() { do_work(); } Must…Make …Faster think feed back feed back feed back feed back feed back feed back feed back feed back feed back feed back feed
41 Violations == Feedback *p= *q= =*p =*q Sequential Time *p= *q= =*p R2 Violation! =*p =*q Parallel 0x0FD8 0xFD20 0x0FC0 0xFC18 Must…Make …Faster
42 Eliminating Violations *p= *q= =*p R2 Violation! =*p =*q Parallel *q= =*q Violation! Eliminate *p Dep. Time 0x0FD8 0xFD20 0x0FC0 0xFC18 Optimization may make slower?
43 Tolerating Violations: Sub-epochs Time *q= Violation! Sub-epochs =*q *q= =*q Violation! Eliminate *p Dep.
44 Sub-epochs Started periodically by hardware How many? When to start? Hardware implementation Just like epochs Use more epoch contexts No need to check violations between sub-epochs within an epoch *q= Violation! Sub-epochs =*q
45 Old TLS Design CPU L1 $ L2 $ Rest of memory system CPU L1 $ CPU L1 $ CPU L1 $ Buffer speculative state in write back L1 cache Invalidation Detect violations through invalidations Rest of system only sees committed data Restart by invalidating speculative lines Problems: L1 cache not large enough Later epochs only get values on commit Problems: L1 cache not large enough Later epochs only get values on commit
46 New Cache Design CPU L1 $ L2 $ Rest of memory system CPU L1 $ CPU L1 $ CPU L1 $ Buffer speculative and non-speculative state for all epochs in L2 Speculative writes immediately visible to L2 (and later epochs) Detect violations at lookup time Invalidation coherence between L2 caches L1 $ L2 $ Invalidation Restart by invalidating speculative lines
47 New Features CPU L1 $ L2 $ Rest of memory system CPU L1 $ CPU L1 $ CPU L1 $ L2 $ Speculative state in L1 and L2 cache Cache line replication (versions) Data dependence tracking within cache Speculative victim cache New!
48 Scaling Sequential 2 CPUs4 CPUs8 CPUs Time (normalized) Sequential 2 CPUs4 CPUs8 CPUs Modified with items/transaction Idle CPU Failed Speculation IUO Mutex Stall Cache Miss Instruction Execution
49 Evaluating a 4-CPU system Sequential TLS Seq No Sub-epoch Baseline No Speculation Time (normalized) Idle CPU Failed Speculation IUO Mutex Stall Cache Miss Instruction Execution Original benchmark run on 1 CPU Parallelized benchmark run on 1 CPU Without sub-epoch support Parallel execution Ignore violations (Amdahl’s Law limit)
50 Sub-epochs: How many/How big? Supporting more sub-epochs is better Spacing depends on location of violations Even spacing is good enough Supporting more sub-epochs is better Spacing depends on location of violations Even spacing is good enough
51 Query Execution Actions taken by a query: Bring pages into buffer pool Acquire and release latches & locks Allocate/free memory Allocate/free and use cursors Use B-trees Generate log entries These generate violations.
52 Applying TLS 1. Parallelize loop 2. Run benchmark 3. Remove bottleneck 4. Go to 2 Time
53 Outline Hardware Developer Transaction Programmer DBMS Programmer
54 Violation Prediction *q= =*q Done Predict Dependences Predictor *q= =*q Violation! Eliminate R1/W2 Time
55 Violation Prediction *q= =*q Done Predict Dependences Predictor Time Predictor problems: Large epochs many predictions Failed prediction violation Incorrect prediction large stall Two predictors required: Last store Dependent load
56 TLS Execution *p= *q= =*p R2 Violation! =*p =*q CPU 1 L1 $ L2 $ Rest of memory system CPU 2 L1 $ CPU 3 L1 $ CPU 4 L1 $
57 TLS Execution *p= *q= =*p R2 Violation! =*p =*q *p 11 *s 1 *t valid CPU 2CPU 3 SM SL SM SL CPU 4 SM SL CPU 1 SM SL
58 TLS Execution *p= *q= =*p R2 Violation! =*p =*q *p 11 *s 1 *t valid CPU 2CPU 3 SM SL SM SL CPU 4 SM SL CPU 1 SM SL
59 TLS Execution *p= *q= =*p R2 Violation! =*p =*q *p 111 valid CPU 2CPU 3 SM SL SM SL CPU 4 SM SL CPU 1 SM SL
60 TLS Execution *p= *q= =*p R2 Violation! =*p =*q *q 1 *p 111 valid CPU 2CPU 3 SM SL SM SL CPU 4 SM SL CPU 1 SM SL 1
61 TLS Execution *p= *q= =*p R2 Violation! =*p =*q 1 *q 1 *p 111 valid CPU 2CPU 3 SM SL SM SL CPU 4 SM SL CPU 1 SM SL 1
62 Replication *p= *q= =*p R2 Violation! =*p =*q *q= *p *q Can’t invalidate line if it contains two epoch’s changes valid CPU 2CPU 3 SM SL SM SL CPU 4 SM SL CPU 1 SM SL
63 Replication *p= *q= =*p R2 Violation! =*p =*q *q= 1 *p *q 1 11 valid CPU 2CPU 3 SM SL SM SL CPU 4 SM SL CPU 1 SM SL
64 Replication *p= *q= =*p R2 Violation! =*p =*q *q= Makes epochs independent Enables sub-epochs 1 *p *q 1 11 valid CPU 2CPU 3 SM SL SM SL CPU 4 SM SL CPU 1 SM SL
65 Sub-epochs *p= *q= =*p =*q *p 1 *q 1 1 valid CPU 1bCPU 1c SM SL SM SL CPU 1d SM SL CPU 1a SM SL 1 1 *q= 1 1 *p= *p a 1b 1c 1d … … … … Uses more epoch contexts Detection/buffering/rewind is “free” More replication: Speculative victim cache
66 get_page() wrapper page_t *get_page_wrapper(pageid_t id) { static tls_mutex mut; page_t *ret; tls_escape_speculation(); check_get_arguments(id); tls_acquire_mutex(&mut); ret = get_page(id); tls_release_mutex(&mut); tls_on_violation(put, ret); tls_resume_speculation() return ret; } Wraps get_page()
67 get_page() wrapper page_t *get_page_wrapper(pageid_t id) { static tls_mutex mut; page_t *ret; tls_escape_speculation(); check_get_arguments(id); tls_acquire_mutex(&mut); ret = get_page(id); tls_release_mutex(&mut); tls_on_violation(put, ret); tls_resume_speculation() return ret; } No violations while calling get_page()
68 May get bad input data from speculative thread! get_page() wrapper page_t *get_page_wrapper(pageid_t id) { static tls_mutex mut; page_t *ret; tls_escape_speculation(); check_get_arguments(id); tls_acquire_mutex(&mut); ret = get_page(id); tls_release_mutex(&mut); tls_on_violation(put, ret); tls_resume_speculation() return ret; }
69 get_page() wrapper page_t *get_page_wrapper(pageid_t id) { static tls_mutex mut; page_t *ret; tls_escape_speculation(); check_get_arguments(id); tls_acquire_mutex(&mut); ret = get_page(id); tls_release_mutex(&mut); tls_on_violation(put, ret); tls_resume_speculation() return ret; } Only one epoch per transaction at a time
70 How to undo get_page() get_page() wrapper page_t *get_page_wrapper(pageid_t id) { static tls_mutex mut; page_t *ret; tls_escape_speculation(); check_get_arguments(id); tls_acquire_mutex(&mut); ret = get_page(id); tls_release_mutex(&mut); tls_on_violation(put, ret); tls_resume_speculation() return ret; }
71 get_page() wrapper page_t *get_page_wrapper(pageid_t id) { static tls_mutex mut; page_t *ret; tls_escape_speculation(); check_get_arguments(id); tls_acquire_mutex(&mut); ret = get_page(id); tls_release_mutex(&mut); tls_on_violation(put, ret); tls_resume_speculation() return ret; } Isolated Undoing this operation does not cause cascading aborts Undoable Easy way to return system to initial state Can also be used for: Cursor management malloc()
72 TPC-C Benchmark Company Warehouse 1Warehouse W District 1District 2District 10 Cust 1Cust 2Cust 3k
73 TPC-C Benchmark Warehouse W District W*10 Customer W*30k Order W*30k+ Order Line W*300k+ New Order W*9k+ History W*30k+ Stock W*100k Item 100k 10 3k W 100k
74 What is TLS? while(cond) { x = hash[i];... hash[j] = y;... } Time = hash[3];... hash[10] =... = hash[19];... hash[21] =... = hash[33];... hash[30] =... = hash[10];... hash[25] =...
75 What is TLS? while(cond) { x = hash[i];... hash[j] = y;... } Time = hash[3];... hash[10] =... = hash[19];... hash[21] =... = hash[33];... hash[30] =... = hash[10];... hash[25] =... Processor AProcessor BProcessor CProcessor D Thread 1Thread 2Thread 3Thread 4
76 What is TLS? while(cond) { x = hash[i];... hash[j] = y;... } Time = hash[3];... hash[10] =... = hash[19];... hash[21] =... = hash[33];... hash[30] =... = hash[10];... hash[25] =... Processor AProcessor BProcessor CProcessor D Thread 1Thread 2Thread 3Thread 4 Violation!
77 What is TLS? while(cond) { x = hash[i];... hash[j] = y;... } Time = hash[3];... hash[10] =... attempt_commit() = hash[19];... hash[21] =... attempt_commit() = hash[33];... hash[30] =... attempt_commit() Processor AProcessor BProcessor CProcessor D Thread 1Thread 2Thread 3Thread 4 Violation! = hash[10];... hash[25] =... attempt_commit()
78 What is TLS? while(cond) { x = hash[i];... hash[j] = y;... } Time = hash[3];... hash[10] =... attempt_commit() = hash[19];... hash[21] =... attempt_commit() = hash[33];... hash[30] =... attempt_commit() = hash[10];... hash[25] =... attempt_commit() Processor AProcessor BProcessor CProcessor D Thread 1Thread 2Thread 3Thread 4 Violation! Redo = hash[10];... hash[25] =... attempt_commit() Thread 4
79 TLS Hardware Design What’s new? Large threads Epochs will communicate Complex control flow Huge legacy code base How does hardware change? Store state in L2 instead of L1 Reversible atomic operations Tolerate dependences Aggressive update propagation (implicit forwarding) Sub-epochs
80 L1 Cache Line SL bit “L2 cache knows this line has been speculatively loaded” On violation or commit: clear SM bit “This line contains speculative changes” On commit: clear On violation: SM Invalid Otherwise, just like a normal cache SL SM Valid LRU Tag Data
81 escaping Speculation Speculative epoch wants to make system visible change! Ignore SM lines while escaped Stale bit “This line may be outdated by speculative work.” On violation or commit: clear SL SM Valid LRU Tag Data Stale
82 L1 to L2 communication L2 sees all stores (write through) L2 sees first load of an epoch NotifySL message L2 can track data dependences!
83 L1 Changes Summary Add three bits to each line SL SM Stale Modify tag match to recognize bits Add queue of NotifySL requests SL SM Valid LRU Tag Data Stale
84 L2 Cache Line SL SM Fine Grained SM ValidDirtyExclusive LRU Tag Data SL SM CPU1CPU2 Cache line can be: Modified by one CPU Loaded by multiple CPUs
85 Cache Line Conflicts Three classes of conflict: Epoch 2 stores, epoch 1 loads Need “old” version to load Epoch 1 stores, epoch 2 stores Need to keep changes separate Epoch 1 loads, epoch 2 stores Need to be able to discard line on violation Need a way of storing multiple conflicting versions in the cache
86 Cache line replication On conflict, replicate line Split line into two copies Divide SM and SL bits at split point Divide directory bits at split point
87 Replication Problems Complicates line lookup Need to find all replicas and select “best” Best == most recent replica Change management On write, update all later copies Also need to find all more speculative replicas to check for violations On commit must get rid of stale lines Invalidation Required Buffer (IRB)
88 Victim Cache How do you deal with a full cache set? Use a victim cache Holds evicted lines without losing SM & SL bits Must be fast every cache lookup needs to know: Do I have the “best” replica of this line? Critical path Do I cause a violation? Not on critical path
89 Summary of Hardware Support Sub-epochs Violations hurt less! Shared cache TLS support Faster communication More room to store state RAOs Don’t speculate on known operations Reduces amount of speculative state
90 Summary of Hardware Changes Sub-epochs Checkpoint register state Needs replicas in cache Shared cache TLS support Speculative L1 Replication in L1 and L2 Speculative victim cache Invalidation Required Buffer RAOs Suspend/resume speculation Mutexes “Undo list”
91 TLS Execution *p= *q= =*p R2 Violation! =*p =*q CPU L1 $ L2 $ Rest of memory system CPU L1 $ CPU L1 $ CPU L1 $ *p Invalidation *p *q *p *q
92
93
94
95 Problems with Old Cache Design Database epochs are large L1 cache not large enough Sub-epochs add more state L1 cache not associative enough Database epochs communicate L1 cache only communicates committed data
96 Intro Summary TLS makes intra-transaction parallelism easy Divide transaction into epochs Hardware support: Detect violations Restart to recover Sub-epochs mitigate penalty Buffer state New process: Modify software avoid violations improve performance
97 Money! pwhile() {} Money! pwhile() {} Money! pwhile() {} Money! pwhile() {} Money! pwhile() {} Money! pwhile() {} Money! pwhile() {} Money! pwhile() {} Money! pwhile() {} Money! pwhile() {} while(too_slow) make_faster(); while(too_slow) make_faster(); while(too_slow) make_faster(); while(too_slow) make_faster(); The Many Faces of Ogg Must tune reorder buffer…
98 Must tune reorder buffer… The Many Faces of Ogg Duh… pwhile() {} Duh… pwhile() {} Duh… pwhile() {} Duh… pwhile() {} Duh… pwhile() {} Duh… pwhile() {} Duh… pwhile() {} Duh… pwhile() {} Duh… pwhile() {} Money! pwhile() {} while(too_slow) make_faster(); while(too_slow) make_faster(); while(too_slow) make_faster(); while(too_slow) make_faster();
99 Removing Bottlenecks Three general techniques: Partition data structures malloc Postpone operations until non-speculative Latches and locks, log entries Handle speculation manually Buffer pool
100 Bottlenecks Encoutered Buffer pool Latches & Locks Malloc/free Cursor queues Error checks False sharing B-tree performance optimization Log entries
101 Must tune reorder buffer… The Many Faces of Ogg Duh… pwhile() {} Duh… pwhile() {} Duh… pwhile() {} Duh… pwhile() {} Duh… pwhile() {} Duh… pwhile() {} Duh… pwhile() {} Duh… pwhile() {} Duh… pwhile() {} Money! pwhile() {} while(too_slow) make_faster(); while(too_slow) make_faster(); while(too_slow) make_faster(); while(too_slow) make_faster();
102 Performance on 4 CPUs Unmodified benchmark:Modified benchmark:
103 Incremental Parallelization Time 4 CPUs
104 Scaling Unmodified benchmark:Modified benchmark:
105 Parallelization is Hard Programmer Effort Performance Improvement Hand Parallelization What we want Parallelizing Compiler Tuning
106 Case Study: New Order (TPC-C) Begin transaction { } End transaction
107 Case Study: New Order (TPC-C) Begin transaction { Read customer info Read & increment order # Create new order } End transaction
108 Case Study: New Order (TPC-C) Begin transaction { Read customer info Read & increment order # Create new order For each item in order { Get item info Decrement count in stock Record order info } } End transaction
109 Case Study: New Order (TPC-C) Begin transaction { Read customer info Read & increment order # Create new order For each item in order { Get item info Decrement count in stock Record order info } } End transaction 80% of transaction execution time
110 Case Study: New Order (TPC-C) Begin transaction { Read customer info Read & increment order # Create new order For each item in order { Get item info Decrement count in stock Record order info } } End transaction 80% of transaction execution time
111 The Many Faces of Ogg Duh… pwhile() {} while(too_slow) make_faster(); Duh… pwhile() {} Duh… pwhile() {} Duh… pwhile() {} Duh… pwhile() {} Duh… pwhile() {} Duh… pwhile() {} Duh… pwhile() {} Duh… pwhile() {} Money! pwhile() {} while(too_slow) make_faster(); while(too_slow) make_faster(); while(too_slow) make_faster(); Must tune reorder buffer…
Step 2: Changing the Software
113 No problem! Loop is easy to parallelize using TLS! Not really Calls into DBMS invoke complex operations Ogg needs to do some work Many operations in DBMS are parallel Not written with TLS in mind!
114 Resource Management Mutexes acquired and released Locks locked and unlocked Cursors pushed and popped from free stack Memory allocated and freed Buffer pool entries Acquired and released
115 Mutexes: Deadlock? Problem: Re-ordered acquire/release operations! Possibly introduced deadlock? Solutions: Avoidance: Static acquire order Recovery: Detect deadlock and violate
116 Locks Like mutexes, but: Allows multiple readers No memory overhead when not held Often held for much longer Treat similarly to mutexes
117 Cursors Used for traversing B-trees Pre-allocated, kept in pools
118 Maintaining Cursor Pool Get Use Release head
119 Maintaining Cursor Pool Get Use Release head
120 Maintaining Cursor Pool Get Use Release head
121 Maintaining Cursor Pool Get Use Release head Get Use Release Violation!
122 Parallelizing Cursor Pool Use per-CPU pools: Modify code: each CPU gets its own pool No sharing == no violations! Requires cpuid() instruction Get Use Release head Get Use Release head
123 Memory Allocation Problem: malloc() metadata causes dependences Solutions: Per-cpu memory pools Parallelized free list
124 The Log Append records to global log Appending causes dependence Can’t parallelize: Global log sequence number (LSN) Generate log records in buffers Assign LSNs when homefree
125 B-Trees Leaf pages contain free space counts Inserts of random records – o.k. Inserting adjacent records Dependence on decrementing count Page splits Infrequent
126 Other Dependences Statistics gathering Error checks False sharing
127 Related Work Lots of work in TLS: Multiscalar (Wisconsin) Hydra (Stanford) IACOMA (Illinois) RAW (MIT) Hand parallelizing using TLS: Manohar Prabhu and Kunle Olukotun (PPoPP’03)
Any questions?
129 Why is this a problem? B-tree insertion into ORDERLINE table Key is ol_n DBMS does not know that keys will be sequential Each insert usually updates the same btree page for(ol_n=0; ol_n<15; ol_n++) { INSERT into ORDERLINE (ol_n, ol_item, ol_cnt) VALUES (:ol_n, :ol_item, :ol_cnt); }
130 Sequential Btree Inserts 4 free 1 itemfree 4 item free 3 item free 2 item free
131 Improvement: SM Versioning Blow away all SM lines on a violation? May be bad! Instead: On primary violation: Only invalidate locally modified SM lines On secondary violation: Invalidate all SM lines Needs one more bit: LocalSM May decrease number of misses to L2 on violation
132 Outline Store state in L2 instead of L1 Reversible atomic operations Tolerate dependences Aggressive update propagation (implicit forwarding) Sub-epochs Results and analysis
133 Outline Store state in L2 instead of L1 Reversible atomic operations Tolerate dependences Aggressive update propagation (implicit forwarding) Sub-epochs Results and analysis
134 Tolerating dependences Aggressive update propagation Get for free! Sub-epochs Periodically checkpoint epochs Every N instructions? Picking N may be interesting Perhaps checkpoints could be set before the location of previous violations?
135 Outline Store state in L2 instead of L1 Reversible atomic operations Tolerate dependences Aggressive update propagation (implicit forwarding) Sub-epochs Results and analysis
136 Why not faster? Possible reasons: Idle cpus RAO mutexes Violations Cache effects Data dependences
137 Why not faster? Possible reasons: Idle cpus 9 epochs/region average Two bundles of four and one of one ¼ of cpu cycles wasted! RAO mutexes Violations Cache effects Data dependences
138 Why not faster? Possible reasons: Idle cpus RAO mutexes Not implemented yet Ooops! Violations Cache effects Data dependences
139 Why not faster? Possible reasons: Idle cpus RAO mutexes Violations 21/969 epochs violated Distance 1 “magic synchronized” 2.2Mcycles (over 4 cpus) About 1.5% Cache effects Data dependences
140 Why not faster? Possible reasons: Idle cpus RAO mutexes Violations Cache effects Deserves its own slide. Data dependences
141 Cache effects of speculation Only 20% of references are speculative! Speculative references have small impact on non-speculative hit rate (<1%) Speculative refs miss a lot in L1 9-15% for reads, 2-6% for writes L2 saw HUGE increase in traffic 152k refs to 3474k refs Spec/non spec lines are thrashing from L1s
142 Why not faster? Possible reasons: Idle cpus RAO mutexes Violations Cache effects Data dependences Oh yeah! Btree item count Split up btree insert? alloc and write Do alloc as RAO Needs more thought
143 L2 Cache Line SL SM Fine Grained SM ValidDirtyExclusive LRU Tag Data SL SM CPU1CPU2 Set 1 Set 2
144 Why are you here? Want faster database systems Have funky new hardware – Thread Level Speculation (TLS) How can we apply TLS to database systems? Side question: Is this a VLDB or an ASPLOS talk?
145 How? Divide transaction into TLS-threads Run TLS-threads in parallel; maintain sequential semantics Profit!
146 Why parallelize transactions? Decrease transaction latency Increase concurrency while avoiding concurrency control bottleneck A.k.a.: use more CPUs, same # of xactions The obvious: Database performance matters
147 Shopping List What do we need? (research scope) Cheap hardware Thread Level Speculation (TLS) Minor changes allowed. Important database application TPC-C Almost no changes allowed! Modular database system BerkeleyDB Some changes allowed.
148 Outline TLS Hardware The Benchmark (TPC-C) Changing the database system Results Conclusions
149 Outline TLS Hardware The Benchmark (TPC-C) Changing the database system Results Conclusions
150 What’s new? Database operations are: Large Complex Large TLS-threads Lots of dependences Difficult to analyze Want: Programmer optimization effort = faster program
151 Hardware changes summary Must tolerate dependences Prediction? Implicit forwarding? May need larger caches May need larger associativity
152 Outline TLS Hardware The Benchmark (TPC-C) Changing the database system Results Conclusions
153 Parallelization Strategy 1. Pick a benchmark 2. Parallelize a loop 3. Analyze dependences 4. Optimize away dependences 5. Evaluate performance 6. If not satisfied, goto 3
154 Outline TLS Hardware The Benchmark (TPC-C) Changing the database system Resource management The log B-trees False sharing Results Conclusions
155 Outline TLS Hardware The Benchmark (TPC-C) Changing the database system Results Conclusions
156 Results Viola simulator: Single CPI Perfect violation prediction No memory system 4 cpus Exhaustive dependence tracking Currently working on an out-of-order superscalar simulation (cello) 10 transaction warm-up Measure 100 transactions
157 Outline TLS Hardware The Benchmark (TPC-C) Changing the database system Results Conclusions
158 Conclusions TLS can improve transaction latency Violation predictors important Iff dependences must be tolerated TLS makes hand parallelizing easier
159 Improving Database Performance How to improve performance: Parallelize transaction Increase number of concurrent transactions Both of these require independence of database operations!
160 Case Study: New Order (TPC-C) Begin transaction { } End transaction
161 Case Study: New Order (TPC-C) Begin transaction { Read customer info ( customer, warehouse ) Read & increment order # ( district ) Create new order ( orders, neworder ) } End transaction
162 Case Study: New Order (TPC-C) Begin transaction { Read customer info ( customer, warehouse ) Read & increment order # ( district ) Create new order ( orders, neworder ) For each item in order { Get item info ( item ) Decrement count in stock ( stock ) Record order info ( orderline ) } } End transaction
163 Case Study: New Order (TPC-C) Begin transaction { Read customer info ( customer, warehouse ) Read & increment order # ( district ) Create new order ( orders, neworder ) For each item in order { Get item info ( item ) Decrement count in stock ( stock ) Record order info ( orderline ) } } End transaction Parallelize this loop
164 Case Study: New Order (TPC-C) Begin transaction { Read customer info ( customer, warehouse ) Read & increment order # ( district ) Create new order ( orders, neworder ) For each item in order { Get item info ( item ) Decrement count in stock ( stock ) Record order info ( orderline ) } } End transaction Parallelize this loop
165 Implementing on a Real DB Using BerkeleyDB “Table” == “Database” Give database any arbitrary key will return arbitrary data (bytes) Use struct s for keys and rows Database provides ACID through: Transactions Locking (page level) Storage management Provides indexing using b-trees
166 Parallelizing a Transaction For each item in order { Get item info ( item ) Decrement count in stock ( stock ) Record order info ( order line ) }
167 Parallelizing a Transaction For each item in order { Get item info ( item ) Decrement count in stock ( stock ) Record order info ( order line ) } Get cursor from pool Use cursor to traverse b-tree Find row, lock page for row Release cursor to pool Get cursor from pool Use cursor to traverse b-tree Find row, lock page for row Release cursor to pool
168 Maintaining Cursor Pool Get Use Release head
169 Maintaining Cursor Pool Get Use Release head
170 Maintaining Cursor Pool Get Use Release head
171 Maintaining Cursor Pool Get Use Release head Get Use Release Violation!
172 Parallelizing Cursor Pool 1 Use per-CPU pools: Modify code: each CPU gets its own pool No sharing == no violations! Requires cpuid() instruction Get Use Release head Get Use Release head
173 Parallelizing Cursor Pool 2 Dequeue and enqueue: atomic and unordered Delay enqueue until end of thread Forces separate pools Avoids modification of data struct Get Use Release head Get Use Release
174 Parallelizing Cursor Pool 3 Atomic unordered dequeue & enqueue Cursor struct is “TLS unordered” Struct defined as a byte range in memory Get Use Release head Get Use Release Get Use Release Get Use Release
175 Parallelizing Cursor Pool 4 Mutex protect dequeue & enqueue; declare pointer to cursor struct to be “TLS unordered” Any access through pointer does not have TLS applied Pointer is tainted, any copies of it keep this property
176 Problems with 3 & 4 What exactly is the boundary of a structure? How do you express the concept of object in a loosely-typed language like C? A byte range or a pointer is only an approximation. Dynamically allocated sub-components?
177 Mutexes in a TLS world Two types of threads: “real” threads TLS threads Two types of mutexes: Inter-real-thread Inter-TLS-thread
178 Inter-real-thread Mutexes Acquire == get mutex for all TLS threads Release == release for current TLS thread May still be held by another TLS thread!
179 Inter-TLS-thread Mutexes Should never interact between two real threads Implies no TLS ordering between TLS threads while mutex is held But what do to on a violation? Can’t just throw away changes to memory Must undo operations performed in critical section
180 Parallelizing Databases using TLS Split transactions into threads Threads created are large 60k+ instructions 16kB of speculative state More dependences between threads How do we design a machine which can handle these large threads?
181 The “Old” Way P L1 $ L2 $ P L1 $ P P Committed state L3 $ P L1 $ L2 $ P L1 $ P P Memory System … Speculative state
182 The “Old” Way Advantages Each epoch has its own L1 cache Epoch state does not intermix Disadvantages L1 cache is too small! Full cache == dead meat No shared speculative memory
183 The “New” Way L2 cache is huge! State of the art in caches, Power5: 1.92MB 10-way L2 32kB 4-way L1 Shared speculative memory “for free” Keeps TLS logic off of the critical path
184 TLS Shared L2 Design L1 Write-through write-no-allocate [CullerSinghGupta99] Easy to understand and reason about Writes visible to L2 – simplifies shared speculative memory L2 cache: shared cache architecture with replication Rest of memory: distributed TLS coherence
185 TLS Shared L2 Design P L1 $ L2 $ P L1 $ P P “Real” speculative state L3 $ P L1 $ L2 $ P L1 $ P P Memory System … Cached speculative state Explain from the top down
186 TLS Shared L2 Design P L1 $ L2 $ P L1 $ P P “Real” speculative state L3 $ P L1 $ L2 $ P L1 $ P P Memory System … Cached speculative state
Part II: Dealing with dependences
188 Predictor Design How do you design a predictor that: Identifies violating loads Identifies the last store that causes them Only triggers when they cause a problem Has very very high accuracy ???
189 Sub-epoch design Like checkpointing Leave “holes” in epoch # space Every 5k instructions start a new epoch Uses more cache to buffer changes More strain on associativity/victim cache Uses more epoch contexts
190 Summary Supporting large epochs needs: Buffer state in L2 instead of L1 Shared speculative memory Replication Victim cache Sub-epochs
Any questions?