Optimistic Intra-Transaction Parallelism on Chip-Multiprocessors Chris Colohan 1, Anastassia Ailamaki 1, J. Gregory Steffan 2 and Todd C. Mowry 1,3 1 Carnegie.

Optimistic Intra-Transaction Parallelism on Chip-Multiprocessors Chris Colohan 1, Anastassia Ailamaki 1, J. Gregory Steffan 2 and Todd C. Mowry 1,3 1 Carnegie Mellon University 2 University of Toronto 3 Intel Research Pittsburgh

2 Chip Multiprocessors are Here! 2 cores now, soon will have 4, 8, 16, or 32 Multiple threads per core How do we best use them? IBM Power 5 AMD Opteron Intel Yonah

3 Multi-Core Enhances Throughput Database ServerUsers TransactionsDBMSDatabase Cores can run concurrent transactions and improve throughput

4 Multi-Core Enhances Throughput Database ServerUsers TransactionsDBMSDatabase Can multiple cores improve transaction latency? Can multiple cores improve transaction latency?

5 Parallelizing transactions SELECT cust_info FROM customer; UPDATE district WITH order_id; INSERT order_id INTO new_order; foreach(item) { GET quantity FROM stock; quantity--; UPDATE stock WITH quantity; INSERT item INTO order_line; } DBMS Intra-query parallelism Used for long-running queries (decision support) Does not work for short queries Short queries dominate in commercial workloads

6 Parallelizing transactions SELECT cust_info FROM customer; UPDATE district WITH order_id; INSERT order_id INTO new_order; foreach(item) { GET quantity FROM stock; quantity--; UPDATE stock WITH quantity; INSERT item INTO order_line; } DBMS Intra-transaction parallelism Each thread spans multiple queries Hard to add to existing systems! Need to change interface, add latches and locks, worry about correctness of parallel execution…

7 Parallelizing transactions SELECT cust_info FROM customer; UPDATE district WITH order_id; INSERT order_id INTO new_order; foreach(item) { GET quantity FROM stock; quantity--; UPDATE stock WITH quantity; INSERT item INTO order_line; } DBMS Intra-transaction parallelism Breaks transaction into threads Hard to add to existing systems! Need to change interface, add latches and locks, worry about correctness of parallel execution… Thread Level Speculation (TLS) makes parallelization easier. Thread Level Speculation (TLS) makes parallelization easier.

8 Thread Level Speculation (TLS) *p= *q= =*p =*q Sequential Time Parallel *p= *q= =*p =*q =*p =*q Epoch 1Epoch 2

9 Thread Level Speculation (TLS) *p= *q= =*p =*q Sequential Time *p= *q= =*p R2 Violation! =*p =*q Parallel Use epochs Detect violations Restart to recover Buffer state Worst case: Sequential Best case: Fully parallel Data dependences limit performance. Epoch 1Epoch 2

10 Transactions DBMS Hardware A Coordinated Effort TPC-C BerkeleyDB Simulated machine

11 Transaction Programmer DBMS Programmer Hardware Developer A Coordinated Effort Choose epoch boundaries Remove performance bottlenecks Add TLS support to architecture

12 So what’s new? Intra-transaction parallelism Without changing the transactions With minor changes to the DBMS Without having to worry about locking Without introducing concurrency bugs With good performance Halve transaction latency on four cores

13 Related Work Optimistic Concurrency Control (Kung82) Sagas (Molina&Salem87) Transaction chopping (Shasha95)

14 Outline Introduction Related work Dividing transactions into epochs Removing bottlenecks in the DBMS Results Conclusions

15 Case Study: New Order (TPC-C) Only dependence is the quantity field Very unlikely to occur (1/100,000) GET cust_info FROM customer; UPDATE district WITH order_id; INSERT order_id INTO new_order; foreach(item) { GET quantity FROM stock WHERE i_id=item; UPDATE stock WITH quantity-1 WHERE i_id=item; INSERT item INTO order_line; } 78% of transaction execution time

16 Case Study: New Order (TPC-C) GET cust_info FROM customer; UPDATE district WITH order_id; INSERT order_id INTO new_order; foreach(item) { GET quantity FROM stock WHERE i_id=item; UPDATE stock WITH quantity-1 WHERE i_id=item; INSERT item INTO order_line; } GET cust_info FROM customer; UPDATE district WITH order_id; INSERT order_id INTO new_order; TLS_foreach (item) { GET quantity FROM stock WHERE i_id=item; UPDATE stock WITH quantity-1 WHERE i_id=item; INSERT item INTO order_line; } GET cust_info FROM customer; UPDATE district WITH order_id; INSERT order_id INTO new_order; TLS_foreach (item) { GET quantity FROM stock WHERE i_id=item; UPDATE stock WITH quantity-1 WHERE i_id=item; INSERT item INTO order_line; }

18 Dependences in DBMS Time

19 Dependences in DBMS Time Dependences serialize execution! Performance tuning: Profile execution Remove bottleneck dependence Repeat

20 Buffer Pool Management CPU Buffer Pool get_page(5) ref: 1 put_page(5) ref: 0

21 get_page(5) put_page(5) Buffer Pool Management CPU Buffer Pool get_page(5) ref: 0 put_page(5) Time get_page(5) put_page(5) TLS ensures first epoch gets page first. Who cares? TLS ensures first epoch gets page first. Who cares? get_page(5) put_page(5)

22 Buffer Pool Management CPU Buffer Pool get_page(5) ref: 0 put_page(5) Time get_page(5) put_page(5) = Escape Speculation Escape speculation Invoke operation Store undo function Resume speculation Escape speculation Invoke operation Store undo function Resume speculation get_page(5) put_page(5) get_page(5)

23 Buffer Pool Management CPU Buffer Pool get_page(5) ref: 0 put_page(5) Time get_page(5) put_page(5) get_page(5) put_page(5) Not undoable! get_page(5) put_page(5) = Escape Speculation

24 Buffer Pool Management CPU Buffer Pool get_page(5) ref: 0 put_page(5) Time get_page(5) put_page(5) get_page(5) put_page(5) Delay put_page until end of epoch Avoid dependence = Escape Speculation

25 Removing Bottleneck Dependences We introduce three techniques: Delay operations until non-speculative Mutex and lock acquire and release Buffer pool, memory, and cursor release Log sequence number assignment Escape speculation Buffer pool, memory, and cursor allocation Traditional parallelization Memory allocation, cursor pool, error checks, false sharing

27 Experimental Setup Detailed simulation Superscalar, out-of-order, 128 entry reorder buffer Memory hierarchy modeled in detail TPC-C transactions on BerkeleyDB In-core database Single user Single warehouse Measure interval of 100 transactions Measuring latency not throughput CPU 32KB 4-way L1 $ Rest of memory system CPU 32KB 4-way L1 $ CPU 32KB 4-way L1 $ CPU 32KB 4-way L1 $ 2MB 4-way L2 $

28 Optimizing the DBMS: New Order 0 0.25 0.5 0.75 1 1.25 Sequential No Optimizations Latches Locks Malloc/Free Buffer Pool Cursor Queue Error Checks False Sharing B-Tree Logging Time (normalized) Idle CPU Violated Cache Miss Busy Cache misses increase Other CPUs not helping Can’t optimize much more 26% improvement

29 Optimizing the DBMS: New Order 0 0.25 0.5 0.75 1 1.25 Sequential No Optimizations Latches Locks Malloc/Free Buffer Pool Cursor Queue Error Checks False Sharing B-Tree Logging Time (normalized) Idle CPU Violated Cache Miss Busy This process took me 30 days and <1200 lines of code. This process took me 30 days and <1200 lines of code.

30 Other TPC-C Transactions 0 0.25 0.5 0.75 1 New OrderDeliveryStock LevelPaymentOrder Status Time (normalized) Idle CPU Failed Cache Miss Busy 3/5 Transactions speed up by 46-66%

31 Conclusions A new form of parallelism for databases Tool for attacking transaction latency Intra-transaction parallelism Without major changes to DBMS TLS can be applied to more than transactions Halve transaction latency by using 4 CPUs

Any questions? For more information, see: www.colohan.com

Backup Slides Follow…

34 TPC-C Transactions on 2 CPUs 0 0.25 0.5 0.75 1 New OrderDeliveryStock LevelPaymentOrder Status Time (normalized) Idle CPU Failed Cache Miss Busy 3/5 Transactions speed up by 30-44%

LATCHES

36 Latches Mutual exclusion between transactions Cause violations between epochs Read-test-write cycle  RAW Not needed between epochs TLS already provides mutual exclusion!

37 Latches: Aggressive Acquire Acquire latch_cnt++ …work… latch_cnt-- Homefree latch_cnt++ …work… (enqueue release) Commit work latch_cnt-- Homefree latch_cnt++ …work… (enqueue release) Commit work latch_cnt-- Release Large critical section

38 Latches: Lazy Acquire Acquire …work… Release Homefree (enqueue acquire) …work… (enqueue release) Acquire Commit work Release Homefree (enqueue acquire) …work… (enqueue release) Acquire Commit work Release Small critical sections

HARDWARE

40 TLS in Database Systems Non-Database TLS Time TLS in Database Systems Large epochs: More dependences Must tolerate More state Bigger buffers Large epochs: More dependences Must tolerate More state Bigger buffers Concurrent transactions

41 Feedback Loop I know this is parallel! for() { do_work(); } par_for() { do_work(); } Must…Make …Faster think feed back feed back feed back feed back feed back feed back feed back feed back feed back feed back feed

42 Violations == Feedback *p= *q= =*p =*q Sequential Time *p= *q= =*p R2 Violation! =*p =*q Parallel 0x0FD8  0xFD20 0x0FC0  0xFC18 Must…Make …Faster

43 Eliminating Violations *p= *q= =*p R2 Violation! =*p =*q Parallel *q= =*q Violation! Eliminate *p Dep. Time 0x0FD8  0xFD20 0x0FC0  0xFC18 Optimization may make slower?

44 Tolerating Violations: Sub-epochs Time *q= Violation! Sub-epochs =*q *q= =*q Violation! Eliminate *p Dep.

45 Sub-epochs Started periodically by hardware How many? When to start? Hardware implementation Just like epochs Use more epoch contexts No need to check violations between sub-epochs within an epoch *q= Violation! Sub-epochs =*q

46 Old TLS Design CPU L1 $ L2 $ Rest of memory system CPU L1 $ CPU L1 $ CPU L1 $ Buffer speculative state in write back L1 cache Invalidation Detect violations through invalidations Rest of system only sees committed data Restart by invalidating speculative lines Problems: L1 cache not large enough Later epochs only get values on commit Problems: L1 cache not large enough Later epochs only get values on commit

47 New Cache Design CPU L1 $ L2 $ Rest of memory system CPU L1 $ CPU L1 $ CPU L1 $ Buffer speculative and non-speculative state for all epochs in L2 Speculative writes immediately visible to L2 (and later epochs) Detect violations at lookup time Invalidation coherence between L2 caches L1 $ L2 $ Invalidation Restart by invalidating speculative lines

48 New Features CPU L1 $ L2 $ Rest of memory system CPU L1 $ CPU L1 $ CPU L1 $ L2 $ Speculative state in L1 and L2 cache Cache line replication (versions) Data dependence tracking within cache Speculative victim cache New!

49 Scaling 0 0.25 0.5 0.75 1 Sequential 2 CPUs4 CPUs8 CPUs Time (normalized) 0 0.25 0.5 0.75 1 Sequential 2 CPUs4 CPUs8 CPUs Modified with 50-150 items/transaction Idle CPU Failed Speculation IUO Mutex Stall Cache Miss Instruction Execution

50 Evaluating a 4-CPU system 0 0.25 0.5 0.75 1 Sequential TLS Seq No Sub-epoch Baseline No Speculation Time (normalized) Idle CPU Failed Speculation IUO Mutex Stall Cache Miss Instruction Execution Original benchmark run on 1 CPU Parallelized benchmark run on 1 CPU Without sub-epoch support Parallel execution Ignore violations (Amdahl’s Law limit)

51 Sub-epochs: How many/How big? Supporting more sub-epochs is better Spacing depends on location of violations Even spacing is good enough Supporting more sub-epochs is better Spacing depends on location of violations Even spacing is good enough

52 Query Execution Actions taken by a query: Bring pages into buffer pool Acquire and release latches & locks Allocate/free memory Allocate/free and use cursors Use B-trees Generate log entries These generate violations.

53 Applying TLS 1. Parallelize loop 2. Run benchmark 3. Remove bottleneck 4. Go to 2 Time

54 Outline Hardware Developer Transaction Programmer DBMS Programmer

55 TLS Execution *p= *q= =*p R2 Violation! =*p =*q *p 11 *s 1 *t 1 1 1 1 valid CPU 2CPU 3 SM SL SM SL CPU 4 SM SL CPU 1 SM SL

56 TLS Execution *p= *q= =*p R2 Violation! =*p =*q *p 11 *s 1 *t 1 1 1 1 1 valid CPU 2CPU 3 SM SL SM SL CPU 4 SM SL CPU 1 SM SL

57 TLS Execution *p= *q= =*p R2 Violation! =*p =*q *p 111 valid CPU 2CPU 3 SM SL SM SL CPU 4 SM SL CPU 1 SM SL

58 TLS Execution *p= *q= =*p R2 Violation! =*p =*q *q 1 *p 111 valid CPU 2CPU 3 SM SL SM SL CPU 4 SM SL CPU 1 SM SL 1

59 TLS Execution *p= *q= =*p R2 Violation! =*p =*q 1 *q 1 *p 111 valid CPU 2CPU 3 SM SL SM SL CPU 4 SM SL CPU 1 SM SL 1

60 Replication *p= *q= =*p R2 Violation! =*p =*q *q= *p 111 1 *q 111 11 Can’t invalidate line if it contains two epoch’s changes valid CPU 2CPU 3 SM SL SM SL CPU 4 SM SL CPU 1 SM SL

61 Replication *p= *q= =*p R2 Violation! =*p =*q *q= 1 *p 111 1 *q 1 11 valid CPU 2CPU 3 SM SL SM SL CPU 4 SM SL CPU 1 SM SL

62 Replication *p= *q= =*p R2 Violation! =*p =*q *q= Makes epochs independent Enables sub-epochs 1 *p 111 1 *q 1 11 valid CPU 2CPU 3 SM SL SM SL CPU 4 SM SL CPU 1 SM SL

63 Sub-epochs *p= *q= =*p =*q *p 1 *q 1 1 valid CPU 1bCPU 1c SM SL SM SL CPU 1d SM SL CPU 1a SM SL 1 1 *q= 1 1 *p= *p 11 1 1a 1b 1c 1d … … … … Uses more epoch contexts Detection/buffering/rewind is “free” More replication: Speculative victim cache

64 get_page() wrapper page_t *get_page_wrapper(pageid_t id) { static tls_mutex mut; page_t *ret; tls_escape_speculation(); check_get_arguments(id); tls_acquire_mutex(&mut); ret = get_page(id); tls_release_mutex(&mut); tls_on_violation(put, ret); tls_resume_speculation() return ret; }  Wraps get_page()

65 get_page() wrapper page_t *get_page_wrapper(pageid_t id) { static tls_mutex mut; page_t *ret; tls_escape_speculation(); check_get_arguments(id); tls_acquire_mutex(&mut); ret = get_page(id); tls_release_mutex(&mut); tls_on_violation(put, ret); tls_resume_speculation() return ret; }  No violations while calling get_page()

66  May get bad input data from speculative thread! get_page() wrapper page_t *get_page_wrapper(pageid_t id) { static tls_mutex mut; page_t *ret; tls_escape_speculation(); check_get_arguments(id); tls_acquire_mutex(&mut); ret = get_page(id); tls_release_mutex(&mut); tls_on_violation(put, ret); tls_resume_speculation() return ret; }

67 get_page() wrapper page_t *get_page_wrapper(pageid_t id) { static tls_mutex mut; page_t *ret; tls_escape_speculation(); check_get_arguments(id); tls_acquire_mutex(&mut); ret = get_page(id); tls_release_mutex(&mut); tls_on_violation(put, ret); tls_resume_speculation() return ret; }  Only one epoch per transaction at a time

68  How to undo get_page() get_page() wrapper page_t *get_page_wrapper(pageid_t id) { static tls_mutex mut; page_t *ret; tls_escape_speculation(); check_get_arguments(id); tls_acquire_mutex(&mut); ret = get_page(id); tls_release_mutex(&mut); tls_on_violation(put, ret); tls_resume_speculation() return ret; }

69 get_page() wrapper page_t *get_page_wrapper(pageid_t id) { static tls_mutex mut; page_t *ret; tls_escape_speculation(); check_get_arguments(id); tls_acquire_mutex(&mut); ret = get_page(id); tls_release_mutex(&mut); tls_on_violation(put, ret); tls_resume_speculation() return ret; } Isolated Undoing this operation does not cause cascading aborts Undoable Easy way to return system to initial state Can also be used for: Cursor management malloc()

70 Sequential Btree Inserts 4 free 1 itemfree 4 item free 3 item free 2 item free

Optimistic Intra-Transaction Parallelism on Chip-Multiprocessors Chris Colohan 1, Anastassia Ailamaki 1, J. Gregory Steffan 2 and Todd C. Mowry 1,3 1 Carnegie.

Similar presentations

Presentation on theme: "Optimistic Intra-Transaction Parallelism on Chip-Multiprocessors Chris Colohan 1, Anastassia Ailamaki 1, J. Gregory Steffan 2 and Todd C. Mowry 1,3 1 Carnegie."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Optimistic Intra-Transaction Parallelism on Chip-Multiprocessors Chris Colohan 1, Anastassia Ailamaki 1, J. Gregory Steffan 2 and Todd C. Mowry 1,3 1 Carnegie.

Similar presentations

Presentation on theme: "Optimistic Intra-Transaction Parallelism on Chip-Multiprocessors Chris Colohan 1, Anastassia Ailamaki 1, J. Gregory Steffan 2 and Todd C. Mowry 1,3 1 Carnegie."— Presentation transcript:

Similar presentations

About project

Feedback