Presentation is loading. Please wait.

Presentation is loading. Please wait.

Optimistic Intra-Transaction Parallelism using Thread Level Speculation Chris Colohan 1, Anastassia Ailamaki 1, J. Gregory Steffan 2 and Todd C. Mowry.

Similar presentations


Presentation on theme: "Optimistic Intra-Transaction Parallelism using Thread Level Speculation Chris Colohan 1, Anastassia Ailamaki 1, J. Gregory Steffan 2 and Todd C. Mowry."— Presentation transcript:

1 Optimistic Intra-Transaction Parallelism using Thread Level Speculation Chris Colohan 1, Anastassia Ailamaki 1, J. Gregory Steffan 2 and Todd C. Mowry 1,3 1 Carnegie Mellon University 2 University of Toronto 3 Intel Research Pittsburgh

2 2 Chip Multiprocessors are Here! 2 cores now, soon will have 4, 8, 16, or 32 Multiple threads per core How do we best use them? IBM Power 5 AMD Opteron Intel Yonah

3 3 Multi-Core Enhances Throughput Database ServerUsers TransactionsDBMSDatabase Cores can run concurrent transactions and improve throughput

4 4 Using Multiple Cores Database ServerUsers TransactionsDBMSDatabase Can multiple cores improve transaction latency? Can multiple cores improve transaction latency?

5 5 Parallelizing transactions SELECT cust_info FROM customer; UPDATE district WITH order_id; INSERT order_id INTO new_order; foreach(item) { GET quantity FROM stock; quantity--; UPDATE stock WITH quantity; INSERT item INTO order_line; } DBMS Intra-query parallelism Used for long-running queries (decision support) Does not work for short queries Short queries dominate in commercial workloads

6 6 Parallelizing transactions SELECT cust_info FROM customer; UPDATE district WITH order_id; INSERT order_id INTO new_order; foreach(item) { GET quantity FROM stock; quantity--; UPDATE stock WITH quantity; INSERT item INTO order_line; } DBMS Intra-transaction parallelism Each thread spans multiple queries Hard to add to existing systems! Need to change interface, add latches and locks, worry about correctness of parallel execution…

7 7 Parallelizing transactions SELECT cust_info FROM customer; UPDATE district WITH order_id; INSERT order_id INTO new_order; foreach(item) { GET quantity FROM stock; quantity--; UPDATE stock WITH quantity; INSERT item INTO order_line; } DBMS Intra-transaction parallelism Breaks transaction into threads Hard to add to existing systems! Need to change interface, add latches and locks, worry about correctness of parallel execution… Thread Level Speculation (TLS) makes parallelization easier. Thread Level Speculation (TLS) makes parallelization easier.

8 8 Thread Level Speculation (TLS) *p= *q= =*p =*q Sequential Time Parallel *p= *q= =*p =*q =*p =*q Epoch 1Epoch 2

9 9 Thread Level Speculation (TLS) *p= *q= =*p =*q Sequential Time *p= *q= =*p R2 Violation! =*p =*q Parallel Use epochs Detect violations Restart to recover Buffer state Oldest epoch: Never restarts No buffering Worst case: Sequential Best case: Fully parallel Data dependences limit performance. Epoch 1Epoch 2

10 10 Transaction Programmer DBMS Programmer Hardware Developer A Coordinated Effort Choose epoch boundaries Remove performance bottlenecks Add TLS support to architecture

11 11 So what’s new? Intra-transaction parallelism Without changing the transactions With minor changes to the DBMS Without having to worry about locking Without introducing concurrency bugs With good performance Halve transaction latency on four cores

12 12 Related Work Optimistic Concurrency Control (Kung82) Sagas (Molina&Salem87) Transaction chopping (Shasha95)

13 13 Outline Introduction Related work Dividing transactions into epochs Removing bottlenecks in the DBMS Results Conclusions

14 14 Case Study: New Order (TPC-C) Only dependence is the quantity field Very unlikely to occur (1/100,000) GET cust_info FROM customer; UPDATE district WITH order_id; INSERT order_id INTO new_order; foreach(item) { GET quantity FROM stock WHERE i_id=item; UPDATE stock WITH quantity-1 WHERE i_id=item; INSERT item INTO order_line; } 78% of transaction execution time

15 15 Case Study: New Order (TPC-C) GET cust_info FROM customer; UPDATE district WITH order_id; INSERT order_id INTO new_order; foreach(item) { GET quantity FROM stock WHERE i_id=item; UPDATE stock WITH quantity-1 WHERE i_id=item; INSERT item INTO order_line; } GET cust_info FROM customer; UPDATE district WITH order_id; INSERT order_id INTO new_order; TLS_foreach (item) { GET quantity FROM stock WHERE i_id=item; UPDATE stock WITH quantity-1 WHERE i_id=item; INSERT item INTO order_line; } GET cust_info FROM customer; UPDATE district WITH order_id; INSERT order_id INTO new_order; TLS_foreach (item) { GET quantity FROM stock WHERE i_id=item; UPDATE stock WITH quantity-1 WHERE i_id=item; INSERT item INTO order_line; }

16 16 Outline Introduction Related work Dividing transactions into epochs Removing bottlenecks in the DBMS Results Conclusions

17 17 Dependences in DBMS Time

18 18 Dependences in DBMS Time Dependences serialize execution! Example: statistics gathering pages_pinned++ TLS maintains serial ordering of increments To remove, use per-CPU counters Performance tuning: Profile execution Remove bottleneck dependence Repeat

19 19 Buffer Pool Management CPU Buffer Pool get_page(5) ref: 1 put_page(5) ref: 0

20 20 get_page(5) put_page(5) Buffer Pool Management CPU Buffer Pool get_page(5) ref: 0 put_page(5) Time get_page(5) put_page(5) TLS ensures first epoch gets page first. Who cares? TLS ensures first epoch gets page first. Who cares? TLS maintains original load/store order Sometimes this is not needed get_page(5) put_page(5)

21 21 Buffer Pool Management CPU Buffer Pool get_page(5) ref: 0 put_page(5) Time get_page(5) put_page(5) = Escape Speculation Isolated: undoing get_page will not affect other transactions Undoable: have an operation ( put_page ) which returns the system to its initial state Isolated: undoing get_page will not affect other transactions Undoable: have an operation ( put_page ) which returns the system to its initial state Escape speculation Invoke operation Store undo function Resume speculation Escape speculation Invoke operation Store undo function Resume speculation get_page(5) put_page(5) get_page(5)

22 22 Buffer Pool Management CPU Buffer Pool get_page(5) ref: 0 put_page(5) Time get_page(5) put_page(5) get_page(5) put_page(5) Not undoable! get_page(5) put_page(5) = Escape Speculation

23 23 Buffer Pool Management CPU Buffer Pool get_page(5) ref: 0 put_page(5) Time get_page(5) put_page(5) get_page(5) put_page(5) Delay put_page until end of epoch Avoid dependence = Escape Speculation

24 24 Removing Bottleneck Dependences We introduce three techniques: Delay operations until non-speculative Mutex and lock acquire and release Buffer pool, memory, and cursor release Log sequence number assignment Escape speculation Buffer pool, memory, and cursor allocation Traditional parallelization Memory allocation, cursor pool, error checks, false sharing

25 25 Outline Introduction Related work Dividing transactions into epochs Removing bottlenecks in the DBMS Results Conclusions

26 26 Experimental Setup Detailed simulation Superscalar, out-of-order, 128 entry reorder buffer Memory hierarchy modeled in detail TPC-C transactions on BerkeleyDB In-core database Single user Single warehouse Measure interval of 100 transactions Measuring latency not throughput CPU 32KB 4-way L1 $ Rest of memory system CPU 32KB 4-way L1 $ CPU 32KB 4-way L1 $ CPU 32KB 4-way L1 $ 2MB 4-way L2 $

27 27 Optimizing the DBMS: New Order 0 0.25 0.5 0.75 1 1.25 Sequential No Optimizations Latches Locks Malloc/Free Buffer Pool Cursor Queue Error Checks False Sharing B-Tree Logging Time (normalized) Idle CPU Violated Cache Miss Busy Cache misses increase Other CPUs not helping Can’t optimize much more 26% improvement

28 28 Optimizing the DBMS: New Order 0 0.25 0.5 0.75 1 1.25 Sequential No Optimizations Latches Locks Malloc/Free Buffer Pool Cursor Queue Error Checks False Sharing B-Tree Logging Time (normalized) Idle CPU Violated Cache Miss Busy This process took me 30 days and <1200 lines of code. This process took me 30 days and <1200 lines of code.

29 29 Other TPC-C Transactions 0 0.25 0.5 0.75 1 New OrderDeliveryStock LevelPaymentOrder Status Time (normalized) Idle CPU Failed Cache Miss Busy 3/5 Transactions speed up by 46-66%

30 30 Conclusions TLS makes intra-transaction parallelism practical Reasonable changes to transaction, DBMS, and hardware Halve transaction latency

31 31 Needed backup slides (not done yet) 2 proc. Results Shared caches may change how you want to extract parallelism! Just have lots of transactions: no sharing TLS may have more sharing

32 Any questions? For more information, see: www.colohan.com

33 Backup Slides Follow…

34 LATCHES

35 35 Latches Mutual exclusion between transactions Cause violations between epochs Read-test-write cycle  RAW Not needed between epochs TLS already provides mutual exclusion!

36 36 Latches: Aggressive Acquire Acquire latch_cnt++ …work… latch_cnt-- Homefree latch_cnt++ …work… (enqueue release) Commit work latch_cnt-- Homefree latch_cnt++ …work… (enqueue release) Commit work latch_cnt-- Release Large critical section

37 37 Latches: Lazy Acquire Acquire …work… Release Homefree (enqueue acquire) …work… (enqueue release) Acquire Commit work Release Homefree (enqueue acquire) …work… (enqueue release) Acquire Commit work Release Small critical sections

38 HARDWARE

39 39 TLS in Database Systems Non-Database TLS Time TLS in Database Systems Large epochs: More dependences Must tolerate More state Bigger buffers Large epochs: More dependences Must tolerate More state Bigger buffers Concurrent transactions

40 40 Feedback Loop I know this is parallel! for() { do_work(); } par_for() { do_work(); } Must…Make …Faster think feed back feed back feed back feed back feed back feed back feed back feed back feed back feed back feed

41 41 Violations == Feedback *p= *q= =*p =*q Sequential Time *p= *q= =*p R2 Violation! =*p =*q Parallel 0x0FD8  0xFD20 0x0FC0  0xFC18 Must…Make …Faster

42 42 Eliminating Violations *p= *q= =*p R2 Violation! =*p =*q Parallel *q= =*q Violation! Eliminate *p Dep. Time 0x0FD8  0xFD20 0x0FC0  0xFC18 Optimization may make slower?

43 43 Tolerating Violations: Sub-epochs Time *q= Violation! Sub-epochs =*q *q= =*q Violation! Eliminate *p Dep.

44 44 Sub-epochs Started periodically by hardware How many? When to start? Hardware implementation Just like epochs Use more epoch contexts No need to check violations between sub-epochs within an epoch *q= Violation! Sub-epochs =*q

45 45 Old TLS Design CPU L1 $ L2 $ Rest of memory system CPU L1 $ CPU L1 $ CPU L1 $ Buffer speculative state in write back L1 cache Invalidation Detect violations through invalidations Rest of system only sees committed data Restart by invalidating speculative lines Problems: L1 cache not large enough Later epochs only get values on commit Problems: L1 cache not large enough Later epochs only get values on commit

46 46 New Cache Design CPU L1 $ L2 $ Rest of memory system CPU L1 $ CPU L1 $ CPU L1 $ Buffer speculative and non-speculative state for all epochs in L2 Speculative writes immediately visible to L2 (and later epochs) Detect violations at lookup time Invalidation coherence between L2 caches L1 $ L2 $ Invalidation Restart by invalidating speculative lines

47 47 New Features CPU L1 $ L2 $ Rest of memory system CPU L1 $ CPU L1 $ CPU L1 $ L2 $ Speculative state in L1 and L2 cache Cache line replication (versions) Data dependence tracking within cache Speculative victim cache New!

48 48 Scaling 0 0.25 0.5 0.75 1 Sequential 2 CPUs4 CPUs8 CPUs Time (normalized) 0 0.25 0.5 0.75 1 Sequential 2 CPUs4 CPUs8 CPUs Modified with 50-150 items/transaction Idle CPU Failed Speculation IUO Mutex Stall Cache Miss Instruction Execution

49 49 Evaluating a 4-CPU system 0 0.25 0.5 0.75 1 Sequential TLS Seq No Sub-epoch Baseline No Speculation Time (normalized) Idle CPU Failed Speculation IUO Mutex Stall Cache Miss Instruction Execution Original benchmark run on 1 CPU Parallelized benchmark run on 1 CPU Without sub-epoch support Parallel execution Ignore violations (Amdahl’s Law limit)

50 50 Sub-epochs: How many/How big? Supporting more sub-epochs is better Spacing depends on location of violations Even spacing is good enough Supporting more sub-epochs is better Spacing depends on location of violations Even spacing is good enough

51 51 Query Execution Actions taken by a query: Bring pages into buffer pool Acquire and release latches & locks Allocate/free memory Allocate/free and use cursors Use B-trees Generate log entries These generate violations.

52 52 Applying TLS 1. Parallelize loop 2. Run benchmark 3. Remove bottleneck 4. Go to 2 Time

53 53 Outline Hardware Developer Transaction Programmer DBMS Programmer

54 54 Violation Prediction *q= =*q Done Predict Dependences Predictor *q= =*q Violation! Eliminate R1/W2 Time

55 55 Violation Prediction *q= =*q Done Predict Dependences Predictor Time Predictor problems: Large epochs  many predictions Failed prediction  violation Incorrect prediction  large stall Two predictors required: Last store Dependent load

56 56 TLS Execution *p= *q= =*p R2 Violation! =*p =*q CPU 1 L1 $ L2 $ Rest of memory system CPU 2 L1 $ CPU 3 L1 $ CPU 4 L1 $

57 57 TLS Execution *p= *q= =*p R2 Violation! =*p =*q *p 11 *s 1 *t 1 1 1 1 valid CPU 2CPU 3 SM SL SM SL CPU 4 SM SL CPU 1 SM SL

58 58 TLS Execution *p= *q= =*p R2 Violation! =*p =*q *p 11 *s 1 *t 1 1 1 1 1 valid CPU 2CPU 3 SM SL SM SL CPU 4 SM SL CPU 1 SM SL

59 59 TLS Execution *p= *q= =*p R2 Violation! =*p =*q *p 111 valid CPU 2CPU 3 SM SL SM SL CPU 4 SM SL CPU 1 SM SL

60 60 TLS Execution *p= *q= =*p R2 Violation! =*p =*q *q 1 *p 111 valid CPU 2CPU 3 SM SL SM SL CPU 4 SM SL CPU 1 SM SL 1

61 61 TLS Execution *p= *q= =*p R2 Violation! =*p =*q 1 *q 1 *p 111 valid CPU 2CPU 3 SM SL SM SL CPU 4 SM SL CPU 1 SM SL 1

62 62 Replication *p= *q= =*p R2 Violation! =*p =*q *q= *p 111 1 *q 111 11 Can’t invalidate line if it contains two epoch’s changes valid CPU 2CPU 3 SM SL SM SL CPU 4 SM SL CPU 1 SM SL

63 63 Replication *p= *q= =*p R2 Violation! =*p =*q *q= 1 *p 111 1 *q 1 11 valid CPU 2CPU 3 SM SL SM SL CPU 4 SM SL CPU 1 SM SL

64 64 Replication *p= *q= =*p R2 Violation! =*p =*q *q= Makes epochs independent Enables sub-epochs 1 *p 111 1 *q 1 11 valid CPU 2CPU 3 SM SL SM SL CPU 4 SM SL CPU 1 SM SL

65 65 Sub-epochs *p= *q= =*p =*q *p 1 *q 1 1 valid CPU 1bCPU 1c SM SL SM SL CPU 1d SM SL CPU 1a SM SL 1 1 *q= 1 1 *p= *p 11 1 1a 1b 1c 1d … … … … Uses more epoch contexts Detection/buffering/rewind is “free” More replication: Speculative victim cache

66 66 get_page() wrapper page_t *get_page_wrapper(pageid_t id) { static tls_mutex mut; page_t *ret; tls_escape_speculation(); check_get_arguments(id); tls_acquire_mutex(&mut); ret = get_page(id); tls_release_mutex(&mut); tls_on_violation(put, ret); tls_resume_speculation() return ret; }  Wraps get_page()

67 67 get_page() wrapper page_t *get_page_wrapper(pageid_t id) { static tls_mutex mut; page_t *ret; tls_escape_speculation(); check_get_arguments(id); tls_acquire_mutex(&mut); ret = get_page(id); tls_release_mutex(&mut); tls_on_violation(put, ret); tls_resume_speculation() return ret; }  No violations while calling get_page()

68 68  May get bad input data from speculative thread! get_page() wrapper page_t *get_page_wrapper(pageid_t id) { static tls_mutex mut; page_t *ret; tls_escape_speculation(); check_get_arguments(id); tls_acquire_mutex(&mut); ret = get_page(id); tls_release_mutex(&mut); tls_on_violation(put, ret); tls_resume_speculation() return ret; }

69 69 get_page() wrapper page_t *get_page_wrapper(pageid_t id) { static tls_mutex mut; page_t *ret; tls_escape_speculation(); check_get_arguments(id); tls_acquire_mutex(&mut); ret = get_page(id); tls_release_mutex(&mut); tls_on_violation(put, ret); tls_resume_speculation() return ret; }  Only one epoch per transaction at a time

70 70  How to undo get_page() get_page() wrapper page_t *get_page_wrapper(pageid_t id) { static tls_mutex mut; page_t *ret; tls_escape_speculation(); check_get_arguments(id); tls_acquire_mutex(&mut); ret = get_page(id); tls_release_mutex(&mut); tls_on_violation(put, ret); tls_resume_speculation() return ret; }

71 71 get_page() wrapper page_t *get_page_wrapper(pageid_t id) { static tls_mutex mut; page_t *ret; tls_escape_speculation(); check_get_arguments(id); tls_acquire_mutex(&mut); ret = get_page(id); tls_release_mutex(&mut); tls_on_violation(put, ret); tls_resume_speculation() return ret; } Isolated Undoing this operation does not cause cascading aborts Undoable Easy way to return system to initial state Can also be used for: Cursor management malloc()

72 72 TPC-C Benchmark Company Warehouse 1Warehouse W District 1District 2District 10 Cust 1Cust 2Cust 3k

73 73 TPC-C Benchmark Warehouse W District W*10 Customer W*30k Order W*30k+ Order Line W*300k+ New Order W*9k+ History W*30k+ Stock W*100k Item 100k 10 3k 1+ 0-1 5-15 3+W 100k

74 74 What is TLS? while(cond) { x = hash[i];... hash[j] = y;... } Time = hash[3];... hash[10] =... = hash[19];... hash[21] =... = hash[33];... hash[30] =... = hash[10];... hash[25] =...

75 75 What is TLS? while(cond) { x = hash[i];... hash[j] = y;... } Time = hash[3];... hash[10] =... = hash[19];... hash[21] =... = hash[33];... hash[30] =... = hash[10];... hash[25] =... Processor AProcessor BProcessor CProcessor D Thread 1Thread 2Thread 3Thread 4

76 76 What is TLS? while(cond) { x = hash[i];... hash[j] = y;... } Time = hash[3];... hash[10] =... = hash[19];... hash[21] =... = hash[33];... hash[30] =... = hash[10];... hash[25] =... Processor AProcessor BProcessor CProcessor D Thread 1Thread 2Thread 3Thread 4 Violation!

77 77 What is TLS? while(cond) { x = hash[i];... hash[j] = y;... } Time = hash[3];... hash[10] =... attempt_commit() = hash[19];... hash[21] =... attempt_commit() = hash[33];... hash[30] =... attempt_commit() Processor AProcessor BProcessor CProcessor D Thread 1Thread 2Thread 3Thread 4 Violation!  = hash[10];... hash[25] =... attempt_commit()

78 78 What is TLS? while(cond) { x = hash[i];... hash[j] = y;... } Time = hash[3];... hash[10] =... attempt_commit() = hash[19];... hash[21] =... attempt_commit() = hash[33];... hash[30] =... attempt_commit() = hash[10];... hash[25] =... attempt_commit() Processor AProcessor BProcessor CProcessor D Thread 1Thread 2Thread 3Thread 4 Violation!  Redo = hash[10];... hash[25] =... attempt_commit() Thread 4

79 79 TLS Hardware Design What’s new? Large threads Epochs will communicate Complex control flow Huge legacy code base How does hardware change? Store state in L2 instead of L1 Reversible atomic operations Tolerate dependences Aggressive update propagation (implicit forwarding) Sub-epochs

80 80 L1 Cache Line SL bit “L2 cache knows this line has been speculatively loaded” On violation or commit: clear SM bit “This line contains speculative changes” On commit: clear On violation: SM  Invalid  Otherwise, just like a normal cache  SL SM Valid LRU Tag Data

81 81 escaping Speculation  Speculative epoch wants to make system visible change! Ignore SM lines while escaped Stale bit “This line may be outdated by speculative work.” On violation or commit: clear SL SM Valid LRU Tag Data Stale

82 82 L1 to L2 communication L2 sees all stores (write through) L2 sees first load of an epoch NotifySL message  L2 can track data dependences!

83 83 L1 Changes Summary Add three bits to each line SL SM Stale Modify tag match to recognize bits Add queue of NotifySL requests SL SM Valid LRU Tag Data Stale

84 84 L2 Cache Line SL SM Fine Grained SM ValidDirtyExclusive LRU Tag Data SL SM CPU1CPU2 Cache line can be: Modified by one CPU Loaded by multiple CPUs

85 85 Cache Line Conflicts Three classes of conflict: Epoch 2 stores, epoch 1 loads Need “old” version to load Epoch 1 stores, epoch 2 stores Need to keep changes separate Epoch 1 loads, epoch 2 stores Need to be able to discard line on violation  Need a way of storing multiple conflicting versions in the cache 

86 86 Cache line replication On conflict, replicate line Split line into two copies Divide SM and SL bits at split point Divide directory bits at split point

87 87 Replication Problems Complicates line lookup Need to find all replicas and select “best” Best == most recent replica Change management On write, update all later copies Also need to find all more speculative replicas to check for violations On commit must get rid of stale lines Invalidation Required Buffer (IRB)

88 88 Victim Cache How do you deal with a full cache set? Use a victim cache Holds evicted lines without losing SM & SL bits Must be fast  every cache lookup needs to know: Do I have the “best” replica of this line? Critical path Do I cause a violation? Not on critical path

89 89 Summary of Hardware Support Sub-epochs Violations hurt less! Shared cache TLS support Faster communication More room to store state RAOs Don’t speculate on known operations Reduces amount of speculative state

90 90 Summary of Hardware Changes Sub-epochs Checkpoint register state Needs replicas in cache Shared cache TLS support Speculative L1 Replication in L1 and L2 Speculative victim cache Invalidation Required Buffer RAOs Suspend/resume speculation Mutexes “Undo list”

91 91 TLS Execution *p= *q= =*p R2 Violation! =*p =*q CPU L1 $ L2 $ Rest of memory system CPU L1 $ CPU L1 $ CPU L1 $ *p Invalidation *p *q *p *q

92 92

93 93

94 94

95 95 Problems with Old Cache Design Database epochs are large L1 cache not large enough Sub-epochs add more state L1 cache not associative enough Database epochs communicate L1 cache only communicates committed data

96 96 Intro Summary TLS makes intra-transaction parallelism easy Divide transaction into epochs Hardware support: Detect violations Restart to recover Sub-epochs mitigate penalty Buffer state New process: Modify software  avoid violations  improve performance

97 97 Money! pwhile() {} Money! pwhile() {} Money! pwhile() {} Money! pwhile() {} Money! pwhile() {} Money! pwhile() {} Money! pwhile() {} Money! pwhile() {} Money! pwhile() {} Money! pwhile() {} while(too_slow) make_faster(); while(too_slow) make_faster(); while(too_slow) make_faster(); while(too_slow) make_faster(); The Many Faces of Ogg Must tune reorder buffer…

98 98 Must tune reorder buffer… The Many Faces of Ogg Duh… pwhile() {} Duh… pwhile() {} Duh… pwhile() {} Duh… pwhile() {} Duh… pwhile() {} Duh… pwhile() {} Duh… pwhile() {} Duh… pwhile() {} Duh… pwhile() {} Money! pwhile() {} while(too_slow) make_faster(); while(too_slow) make_faster(); while(too_slow) make_faster(); while(too_slow) make_faster();

99 99 Removing Bottlenecks Three general techniques: Partition data structures malloc Postpone operations until non-speculative Latches and locks, log entries Handle speculation manually Buffer pool

100 100 Bottlenecks Encoutered Buffer pool Latches & Locks Malloc/free Cursor queues Error checks False sharing B-tree performance optimization Log entries

101 101 Must tune reorder buffer… The Many Faces of Ogg Duh… pwhile() {} Duh… pwhile() {} Duh… pwhile() {} Duh… pwhile() {} Duh… pwhile() {} Duh… pwhile() {} Duh… pwhile() {} Duh… pwhile() {} Duh… pwhile() {} Money! pwhile() {} while(too_slow) make_faster(); while(too_slow) make_faster(); while(too_slow) make_faster(); while(too_slow) make_faster();

102 102 Performance on 4 CPUs Unmodified benchmark:Modified benchmark:

103 103 Incremental Parallelization Time 4 CPUs

104 104 Scaling Unmodified benchmark:Modified benchmark:

105 105 Parallelization is Hard Programmer Effort Performance Improvement Hand Parallelization What we want Parallelizing Compiler Tuning

106 106 Case Study: New Order (TPC-C) Begin transaction { } End transaction

107 107 Case Study: New Order (TPC-C) Begin transaction { Read customer info Read & increment order # Create new order } End transaction

108 108 Case Study: New Order (TPC-C) Begin transaction { Read customer info Read & increment order # Create new order For each item in order { Get item info Decrement count in stock Record order info } } End transaction

109 109 Case Study: New Order (TPC-C) Begin transaction { Read customer info Read & increment order # Create new order For each item in order { Get item info Decrement count in stock Record order info } } End transaction 80% of transaction execution time

110 110 Case Study: New Order (TPC-C) Begin transaction { Read customer info Read & increment order # Create new order For each item in order { Get item info Decrement count in stock Record order info } } End transaction 80% of transaction execution time

111 111 The Many Faces of Ogg Duh… pwhile() {} while(too_slow) make_faster(); Duh… pwhile() {} Duh… pwhile() {} Duh… pwhile() {} Duh… pwhile() {} Duh… pwhile() {} Duh… pwhile() {} Duh… pwhile() {} Duh… pwhile() {} Money! pwhile() {} while(too_slow) make_faster(); while(too_slow) make_faster(); while(too_slow) make_faster(); Must tune reorder buffer…

112 Step 2: Changing the Software

113 113 No problem! Loop is easy to parallelize using TLS! Not really Calls into DBMS invoke complex operations Ogg needs to do some work Many operations in DBMS are parallel Not written with TLS in mind!

114 114 Resource Management Mutexes acquired and released Locks locked and unlocked Cursors pushed and popped from free stack Memory allocated and freed Buffer pool entries Acquired and released

115 115 Mutexes: Deadlock? Problem: Re-ordered acquire/release operations!  Possibly introduced deadlock? Solutions: Avoidance: Static acquire order Recovery: Detect deadlock and violate

116 116 Locks Like mutexes, but: Allows multiple readers No memory overhead when not held Often held for much longer Treat similarly to mutexes

117 117 Cursors Used for traversing B-trees Pre-allocated, kept in pools

118 118 Maintaining Cursor Pool Get Use Release head

119 119 Maintaining Cursor Pool Get Use Release head

120 120 Maintaining Cursor Pool Get Use Release head

121 121 Maintaining Cursor Pool Get Use Release head Get Use Release Violation!

122 122 Parallelizing Cursor Pool Use per-CPU pools: Modify code: each CPU gets its own pool No sharing == no violations! Requires cpuid() instruction Get Use Release head Get Use Release head

123 123 Memory Allocation Problem: malloc() metadata causes dependences Solutions: Per-cpu memory pools Parallelized free list

124 124 The Log Append records to global log Appending causes dependence Can’t parallelize: Global log sequence number (LSN) Generate log records in buffers Assign LSNs when homefree

125 125 B-Trees Leaf pages contain free space counts Inserts of random records – o.k. Inserting adjacent records Dependence on decrementing count Page splits Infrequent

126 126 Other Dependences Statistics gathering Error checks False sharing

127 127 Related Work Lots of work in TLS: Multiscalar (Wisconsin) Hydra (Stanford) IACOMA (Illinois) RAW (MIT) Hand parallelizing using TLS: Manohar Prabhu and Kunle Olukotun (PPoPP’03)

128 Any questions?

129 129 Why is this a problem? B-tree insertion into ORDERLINE table Key is ol_n DBMS does not know that keys will be sequential Each insert usually updates the same btree page for(ol_n=0; ol_n<15; ol_n++) { INSERT into ORDERLINE (ol_n, ol_item, ol_cnt) VALUES (:ol_n, :ol_item, :ol_cnt); }

130 130 Sequential Btree Inserts 4 free 1 itemfree 4 item free 3 item free 2 item free

131 131 Improvement: SM Versioning Blow away all SM lines on a violation? May be bad! Instead: On primary violation: Only invalidate locally modified SM lines On secondary violation: Invalidate all SM lines Needs one more bit: LocalSM May decrease number of misses to L2 on violation

132 132 Outline Store state in L2 instead of L1 Reversible atomic operations Tolerate dependences Aggressive update propagation (implicit forwarding) Sub-epochs Results and analysis

133 133 Outline Store state in L2 instead of L1 Reversible atomic operations Tolerate dependences Aggressive update propagation (implicit forwarding) Sub-epochs Results and analysis

134 134 Tolerating dependences Aggressive update propagation Get for free! Sub-epochs Periodically checkpoint epochs Every N instructions? Picking N may be interesting Perhaps checkpoints could be set before the location of previous violations?

135 135 Outline Store state in L2 instead of L1 Reversible atomic operations Tolerate dependences Aggressive update propagation (implicit forwarding) Sub-epochs Results and analysis

136 136 Why not faster? Possible reasons: Idle cpus RAO mutexes Violations Cache effects Data dependences

137 137 Why not faster? Possible reasons: Idle cpus 9 epochs/region average Two bundles of four and one of one ¼ of cpu cycles wasted! RAO mutexes Violations Cache effects Data dependences

138 138 Why not faster? Possible reasons: Idle cpus RAO mutexes Not implemented yet Ooops! Violations Cache effects Data dependences

139 139 Why not faster? Possible reasons: Idle cpus RAO mutexes Violations 21/969 epochs violated Distance 1 “magic synchronized” 2.2Mcycles (over 4 cpus) About 1.5% Cache effects Data dependences

140 140 Why not faster? Possible reasons: Idle cpus RAO mutexes Violations Cache effects Deserves its own slide. Data dependences

141 141 Cache effects of speculation Only 20% of references are speculative! Speculative references have small impact on non-speculative hit rate (<1%) Speculative refs miss a lot in L1 9-15% for reads, 2-6% for writes L2 saw HUGE increase in traffic 152k refs to 3474k refs Spec/non spec lines are thrashing from L1s

142 142 Why not faster? Possible reasons: Idle cpus RAO mutexes Violations Cache effects Data dependences Oh yeah! Btree item count Split up btree insert? alloc and write Do alloc as RAO Needs more thought

143 143 L2 Cache Line SL SM Fine Grained SM ValidDirtyExclusive LRU Tag Data SL SM CPU1CPU2 Set 1 Set 2

144 144 Why are you here? Want faster database systems Have funky new hardware – Thread Level Speculation (TLS) How can we apply TLS to database systems? Side question: Is this a VLDB or an ASPLOS talk?

145 145 How? Divide transaction into TLS-threads Run TLS-threads in parallel; maintain sequential semantics Profit!

146 146 Why parallelize transactions? Decrease transaction latency Increase concurrency while avoiding concurrency control bottleneck A.k.a.: use more CPUs, same # of xactions The obvious: Database performance matters

147 147 Shopping List What do we need? (research scope) Cheap hardware Thread Level Speculation (TLS) Minor changes allowed. Important database application TPC-C Almost no changes allowed! Modular database system BerkeleyDB Some changes allowed.

148 148 Outline TLS Hardware The Benchmark (TPC-C) Changing the database system Results Conclusions

149 149 Outline TLS Hardware The Benchmark (TPC-C) Changing the database system Results Conclusions

150 150 What’s new? Database operations are: Large Complex Large TLS-threads Lots of dependences Difficult to analyze Want: Programmer optimization effort = faster program

151 151 Hardware changes summary Must tolerate dependences Prediction? Implicit forwarding? May need larger caches May need larger associativity

152 152 Outline TLS Hardware The Benchmark (TPC-C) Changing the database system Results Conclusions

153 153 Parallelization Strategy 1. Pick a benchmark 2. Parallelize a loop 3. Analyze dependences 4. Optimize away dependences 5. Evaluate performance 6. If not satisfied, goto 3

154 154 Outline TLS Hardware The Benchmark (TPC-C) Changing the database system Resource management The log B-trees False sharing Results Conclusions

155 155 Outline TLS Hardware The Benchmark (TPC-C) Changing the database system Results Conclusions

156 156 Results Viola simulator: Single CPI Perfect violation prediction No memory system 4 cpus Exhaustive dependence tracking Currently working on an out-of-order superscalar simulation (cello) 10 transaction warm-up Measure 100 transactions

157 157 Outline TLS Hardware The Benchmark (TPC-C) Changing the database system Results Conclusions

158 158 Conclusions TLS can improve transaction latency Violation predictors important Iff dependences must be tolerated TLS makes hand parallelizing easier

159 159 Improving Database Performance How to improve performance: Parallelize transaction Increase number of concurrent transactions Both of these require independence of database operations!

160 160 Case Study: New Order (TPC-C) Begin transaction { } End transaction

161 161 Case Study: New Order (TPC-C) Begin transaction { Read customer info ( customer, warehouse ) Read & increment order # ( district ) Create new order ( orders, neworder ) } End transaction

162 162 Case Study: New Order (TPC-C) Begin transaction { Read customer info ( customer, warehouse ) Read & increment order # ( district ) Create new order ( orders, neworder ) For each item in order { Get item info ( item ) Decrement count in stock ( stock ) Record order info ( orderline ) } } End transaction

163 163 Case Study: New Order (TPC-C) Begin transaction { Read customer info ( customer, warehouse ) Read & increment order # ( district ) Create new order ( orders, neworder ) For each item in order { Get item info ( item ) Decrement count in stock ( stock ) Record order info ( orderline ) } } End transaction Parallelize this loop

164 164 Case Study: New Order (TPC-C) Begin transaction { Read customer info ( customer, warehouse ) Read & increment order # ( district ) Create new order ( orders, neworder ) For each item in order { Get item info ( item ) Decrement count in stock ( stock ) Record order info ( orderline ) } } End transaction Parallelize this loop

165 165 Implementing on a Real DB Using BerkeleyDB “Table” == “Database” Give database any arbitrary key  will return arbitrary data (bytes) Use struct s for keys and rows Database provides ACID through: Transactions Locking (page level) Storage management Provides indexing using b-trees

166 166 Parallelizing a Transaction For each item in order { Get item info ( item ) Decrement count in stock ( stock ) Record order info ( order line ) }

167 167 Parallelizing a Transaction For each item in order { Get item info ( item ) Decrement count in stock ( stock ) Record order info ( order line ) } Get cursor from pool Use cursor to traverse b-tree Find row, lock page for row Release cursor to pool Get cursor from pool Use cursor to traverse b-tree Find row, lock page for row Release cursor to pool

168 168 Maintaining Cursor Pool Get Use Release head

169 169 Maintaining Cursor Pool Get Use Release head

170 170 Maintaining Cursor Pool Get Use Release head

171 171 Maintaining Cursor Pool Get Use Release head Get Use Release Violation!

172 172 Parallelizing Cursor Pool 1 Use per-CPU pools: Modify code: each CPU gets its own pool No sharing == no violations! Requires cpuid() instruction Get Use Release head Get Use Release head

173 173 Parallelizing Cursor Pool 2 Dequeue and enqueue: atomic and unordered Delay enqueue until end of thread Forces separate pools Avoids modification of data struct Get Use Release head Get Use Release

174 174 Parallelizing Cursor Pool 3 Atomic unordered dequeue & enqueue Cursor struct is “TLS unordered” Struct defined as a byte range in memory Get Use Release head Get Use Release Get Use Release Get Use Release

175 175 Parallelizing Cursor Pool 4 Mutex protect dequeue & enqueue; declare pointer to cursor struct to be “TLS unordered” Any access through pointer does not have TLS applied Pointer is tainted, any copies of it keep this property

176 176 Problems with 3 & 4 What exactly is the boundary of a structure? How do you express the concept of object in a loosely-typed language like C? A byte range or a pointer is only an approximation. Dynamically allocated sub-components?

177 177 Mutexes in a TLS world Two types of threads: “real” threads TLS threads Two types of mutexes: Inter-real-thread Inter-TLS-thread

178 178 Inter-real-thread Mutexes Acquire == get mutex for all TLS threads Release == release for current TLS thread May still be held by another TLS thread!

179 179 Inter-TLS-thread Mutexes Should never interact between two real threads Implies no TLS ordering between TLS threads while mutex is held But what do to on a violation? Can’t just throw away changes to memory Must undo operations performed in critical section

180 180 Parallelizing Databases using TLS Split transactions into threads Threads created are large 60k+ instructions 16kB of speculative state More dependences between threads How do we design a machine which can handle these large threads?

181 181 The “Old” Way P L1 $ L2 $ P L1 $ P P Committed state L3 $ P L1 $ L2 $ P L1 $ P P Memory System … Speculative state

182 182 The “Old” Way Advantages Each epoch has its own L1 cache Epoch state does not intermix Disadvantages L1 cache is too small! Full cache == dead meat No shared speculative memory

183 183 The “New” Way L2 cache is huge! State of the art in caches, Power5: 1.92MB 10-way L2 32kB 4-way L1 Shared speculative memory “for free” Keeps TLS logic off of the critical path

184 184 TLS Shared L2 Design L1 Write-through write-no-allocate [CullerSinghGupta99] Easy to understand and reason about Writes visible to L2 – simplifies shared speculative memory L2 cache: shared cache architecture with replication Rest of memory: distributed TLS coherence

185 185 TLS Shared L2 Design P L1 $ L2 $ P L1 $ P P “Real” speculative state L3 $ P L1 $ L2 $ P L1 $ P P Memory System … Cached speculative state Explain from the top down

186 186 TLS Shared L2 Design P L1 $ L2 $ P L1 $ P P “Real” speculative state L3 $ P L1 $ L2 $ P L1 $ P P Memory System … Cached speculative state

187 Part II: Dealing with dependences

188 188 Predictor Design How do you design a predictor that: Identifies violating loads Identifies the last store that causes them Only triggers when they cause a problem Has very very high accuracy ???

189 189 Sub-epoch design Like checkpointing Leave “holes” in epoch # space Every 5k instructions start a new epoch Uses more cache to buffer changes More strain on associativity/victim cache Uses more epoch contexts

190 190 Summary Supporting large epochs needs: Buffer state in L2 instead of L1 Shared speculative memory Replication Victim cache Sub-epochs

191 Any questions?


Download ppt "Optimistic Intra-Transaction Parallelism using Thread Level Speculation Chris Colohan 1, Anastassia Ailamaki 1, J. Gregory Steffan 2 and Todd C. Mowry."

Similar presentations


Ads by Google