Optimistic Intra-Transaction Parallelism using Thread Level Speculation Chris Colohan 1, Anastassia Ailamaki 1, J. Gregory Steffan 2 and Todd C. Mowry.

Slides:



Advertisements
Similar presentations
Optimistic Methods for Concurrency Control By : H.T. Kung & John T. Robinson Presenters: Munawer Saeed.
Advertisements

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.
Anshul Kumar, CSE IITD CSL718 : VLIW - Software Driven ILP Hardware Support for Exposing ILP at Compile Time 3rd Apr, 2006.
CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.
Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.
A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.
Lecture 12 Reduce Miss Penalty and Hit Time
Performance of Cache Memory
ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto
CS4432: Database Systems II Buffer Manager 1. 2 Covered in week 1.
Transaction.
Thread-Level Transactional Memory Decoupling Interface and Implementation UW Computer Architecture Affiliates Conference Kevin Moore October 21, 2004.
More on Thread Level Speculation Anthony Gitter Dafna Shahaf Or Sheffet.
PARALLEL PROGRAMMING with TRANSACTIONAL MEMORY Pratibha Kona.
Optimistic Intra-Transaction Parallelism on Chip-Multiprocessors Chris Colohan 1, Anastassia Ailamaki 1, J. Gregory Steffan 2 and Todd C. Mowry 1,3 1 Carnegie.
1 Lecture 21: Transactional Memory Topics: consistency model recap, introduction to transactional memory.
A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University.
1 MetaTM/TxLinux: Transactional Memory For An Operating System Hany E. Ramadan, Christopher J. Rossbach, Donald E. Porter and Owen S. Hofmann Presenter:
Transaction Management and Concurrency Control
1 Lecture 7: Transactional Memory Intro Topics: introduction to transactional memory, “lazy” implementation.
1 Lecture 23: Transactional Memory Topics: consistency model recap, introduction to transactional memory.
Memory Management 2010.
CS 300 – Lecture 20 Intro to Computer Architecture / Assembly Language Caches.
Tolerating Dependences Between Large Speculative Threads Via Sub-Threads Chris Colohan 1,2, Anastassia Ailamaki 2, J. Gregory Steffan 3 and Todd C. Mowry.
Multiscalar processors
1  Caches load multiple bytes per block to take advantage of spatial locality  If cache block size = 2 n bytes, conceptually split memory into 2 n -byte.
Computer Organization and Architecture
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.
Highly Available ACID Memory Vijayshankar Raman. Introduction §Why ACID memory? l non-database apps: want updates to critical data to be atomic and persistent.
Optimistic Intra-Transaction Parallelism on Chip-Multiprocessors Chris Colohan 1, Anastassia Ailamaki 1, J. Gregory Steffan 2 and Todd C. Mowry 1,3 1 Carnegie.
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)
Introduction to Database Systems1. 2 Basic Definitions Mini-world Some part of the real world about which data is stored in a database. Data Known facts.
Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
Transaction Management Transparencies. ©Pearson Education 2009 Chapter 14 - Objectives Function and importance of transactions. Properties of transactions.
Implementing Precise Interrupts in Pipelined Processors James E. Smith Andrew R.Pleszkun Presented By: Shrikant G.
The Standford Hydra CMP  Lance Hammond  Benedict A. Hubbert  Michael Siu  Manohar K. Prabhu  Michael Chen  Kunle Olukotun Presented by Jason Davis.
1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.
Memory Design Principles Principle of locality dominates design Smaller = faster Hierarchy goal: total memory system almost as cheap as the cheapest component,
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
CS161 – Design and Architecture of Computer
Lecture 20: Consistency Models, TM
CS161 – Design and Architecture of Computer
Speculative Lock Elision
Minh, Trautmann, Chung, McDonald, Bronson, Casper, Kozyrakis, Olukotun
Multiscalar Processors
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Chapter 8: Main Memory.
The University of Adelaide, School of Computer Science
Predictive Performance
Lecture 6: Transactions
Lecture: Cache Innovations, Virtual Memory
15-740/ Computer Architecture Lecture 5: Precise Exceptions
Lecture 22: Consistency Models, TM
* From AMD 1996 Publication #18522 Revision E
CS 3410, Spring 2014 Computer Science Cornell University
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
CSE 153 Design of Operating Systems Winter 19
Lecture 17 Multiprocessors and Thread-Level Parallelism
Programming with Shared Memory Specifying parallelism
Lecture 23: Transactional Memory
Lecture: Consistency Models, TM
The University of Adelaide, School of Computer Science
10/18: Lecture Topics Using spatial locality
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

Optimistic Intra-Transaction Parallelism using Thread Level Speculation Chris Colohan 1, Anastassia Ailamaki 1, J. Gregory Steffan 2 and Todd C. Mowry 1,3 1 Carnegie Mellon University 2 University of Toronto 3 Intel Research Pittsburgh

2 Chip Multiprocessors are Here! 2 cores now, soon will have 4, 8, 16, or 32 Multiple threads per core How do we best use them? IBM Power 5 AMD Opteron Intel Yonah

3 Multi-Core Enhances Throughput Database ServerUsers TransactionsDBMSDatabase Cores can run concurrent transactions and improve throughput

4 Using Multiple Cores Database ServerUsers TransactionsDBMSDatabase Can multiple cores improve transaction latency? Can multiple cores improve transaction latency?

5 Parallelizing transactions SELECT cust_info FROM customer; UPDATE district WITH order_id; INSERT order_id INTO new_order; foreach(item) { GET quantity FROM stock; quantity--; UPDATE stock WITH quantity; INSERT item INTO order_line; } DBMS Intra-query parallelism Used for long-running queries (decision support) Does not work for short queries Short queries dominate in commercial workloads

6 Parallelizing transactions SELECT cust_info FROM customer; UPDATE district WITH order_id; INSERT order_id INTO new_order; foreach(item) { GET quantity FROM stock; quantity--; UPDATE stock WITH quantity; INSERT item INTO order_line; } DBMS Intra-transaction parallelism Each thread spans multiple queries Hard to add to existing systems! Need to change interface, add latches and locks, worry about correctness of parallel execution…

7 Parallelizing transactions SELECT cust_info FROM customer; UPDATE district WITH order_id; INSERT order_id INTO new_order; foreach(item) { GET quantity FROM stock; quantity--; UPDATE stock WITH quantity; INSERT item INTO order_line; } DBMS Intra-transaction parallelism Breaks transaction into threads Hard to add to existing systems! Need to change interface, add latches and locks, worry about correctness of parallel execution… Thread Level Speculation (TLS) makes parallelization easier. Thread Level Speculation (TLS) makes parallelization easier.

8 Thread Level Speculation (TLS) *p= *q= =*p =*q Sequential Time Parallel *p= *q= =*p =*q =*p =*q Epoch 1Epoch 2

9 Thread Level Speculation (TLS) *p= *q= =*p =*q Sequential Time *p= *q= =*p R2 Violation! =*p =*q Parallel Use epochs Detect violations Restart to recover Buffer state Oldest epoch: Never restarts No buffering Worst case: Sequential Best case: Fully parallel Data dependences limit performance. Epoch 1Epoch 2

10 Transaction Programmer DBMS Programmer Hardware Developer A Coordinated Effort Choose epoch boundaries Remove performance bottlenecks Add TLS support to architecture

11 So what’s new? Intra-transaction parallelism Without changing the transactions With minor changes to the DBMS Without having to worry about locking Without introducing concurrency bugs With good performance Halve transaction latency on four cores

12 Related Work Optimistic Concurrency Control (Kung82) Sagas (Molina&Salem87) Transaction chopping (Shasha95)

13 Outline Introduction Related work Dividing transactions into epochs Removing bottlenecks in the DBMS Results Conclusions

14 Case Study: New Order (TPC-C) Only dependence is the quantity field Very unlikely to occur (1/100,000) GET cust_info FROM customer; UPDATE district WITH order_id; INSERT order_id INTO new_order; foreach(item) { GET quantity FROM stock WHERE i_id=item; UPDATE stock WITH quantity-1 WHERE i_id=item; INSERT item INTO order_line; } 78% of transaction execution time

15 Case Study: New Order (TPC-C) GET cust_info FROM customer; UPDATE district WITH order_id; INSERT order_id INTO new_order; foreach(item) { GET quantity FROM stock WHERE i_id=item; UPDATE stock WITH quantity-1 WHERE i_id=item; INSERT item INTO order_line; } GET cust_info FROM customer; UPDATE district WITH order_id; INSERT order_id INTO new_order; TLS_foreach (item) { GET quantity FROM stock WHERE i_id=item; UPDATE stock WITH quantity-1 WHERE i_id=item; INSERT item INTO order_line; } GET cust_info FROM customer; UPDATE district WITH order_id; INSERT order_id INTO new_order; TLS_foreach (item) { GET quantity FROM stock WHERE i_id=item; UPDATE stock WITH quantity-1 WHERE i_id=item; INSERT item INTO order_line; }

16 Outline Introduction Related work Dividing transactions into epochs Removing bottlenecks in the DBMS Results Conclusions

17 Dependences in DBMS Time

18 Dependences in DBMS Time Dependences serialize execution! Example: statistics gathering pages_pinned++ TLS maintains serial ordering of increments To remove, use per-CPU counters Performance tuning: Profile execution Remove bottleneck dependence Repeat

19 Buffer Pool Management CPU Buffer Pool get_page(5) ref: 1 put_page(5) ref: 0

20 get_page(5) put_page(5) Buffer Pool Management CPU Buffer Pool get_page(5) ref: 0 put_page(5) Time get_page(5) put_page(5) TLS ensures first epoch gets page first. Who cares? TLS ensures first epoch gets page first. Who cares? TLS maintains original load/store order Sometimes this is not needed get_page(5) put_page(5)

21 Buffer Pool Management CPU Buffer Pool get_page(5) ref: 0 put_page(5) Time get_page(5) put_page(5) = Escape Speculation Isolated: undoing get_page will not affect other transactions Undoable: have an operation ( put_page ) which returns the system to its initial state Isolated: undoing get_page will not affect other transactions Undoable: have an operation ( put_page ) which returns the system to its initial state Escape speculation Invoke operation Store undo function Resume speculation Escape speculation Invoke operation Store undo function Resume speculation get_page(5) put_page(5) get_page(5)

22 Buffer Pool Management CPU Buffer Pool get_page(5) ref: 0 put_page(5) Time get_page(5) put_page(5) get_page(5) put_page(5) Not undoable! get_page(5) put_page(5) = Escape Speculation

23 Buffer Pool Management CPU Buffer Pool get_page(5) ref: 0 put_page(5) Time get_page(5) put_page(5) get_page(5) put_page(5) Delay put_page until end of epoch Avoid dependence = Escape Speculation

24 Removing Bottleneck Dependences We introduce three techniques: Delay operations until non-speculative Mutex and lock acquire and release Buffer pool, memory, and cursor release Log sequence number assignment Escape speculation Buffer pool, memory, and cursor allocation Traditional parallelization Memory allocation, cursor pool, error checks, false sharing

25 Outline Introduction Related work Dividing transactions into epochs Removing bottlenecks in the DBMS Results Conclusions

26 Experimental Setup Detailed simulation Superscalar, out-of-order, 128 entry reorder buffer Memory hierarchy modeled in detail TPC-C transactions on BerkeleyDB In-core database Single user Single warehouse Measure interval of 100 transactions Measuring latency not throughput CPU 32KB 4-way L1 $ Rest of memory system CPU 32KB 4-way L1 $ CPU 32KB 4-way L1 $ CPU 32KB 4-way L1 $ 2MB 4-way L2 $

27 Optimizing the DBMS: New Order Sequential No Optimizations Latches Locks Malloc/Free Buffer Pool Cursor Queue Error Checks False Sharing B-Tree Logging Time (normalized) Idle CPU Violated Cache Miss Busy Cache misses increase Other CPUs not helping Can’t optimize much more 26% improvement

28 Optimizing the DBMS: New Order Sequential No Optimizations Latches Locks Malloc/Free Buffer Pool Cursor Queue Error Checks False Sharing B-Tree Logging Time (normalized) Idle CPU Violated Cache Miss Busy This process took me 30 days and <1200 lines of code. This process took me 30 days and <1200 lines of code.

29 Other TPC-C Transactions New OrderDeliveryStock LevelPaymentOrder Status Time (normalized) Idle CPU Failed Cache Miss Busy 3/5 Transactions speed up by 46-66%

30 Conclusions TLS makes intra-transaction parallelism practical Reasonable changes to transaction, DBMS, and hardware Halve transaction latency

31 Needed backup slides (not done yet) 2 proc. Results Shared caches may change how you want to extract parallelism! Just have lots of transactions: no sharing TLS may have more sharing

Any questions? For more information, see:

Backup Slides Follow…

LATCHES

35 Latches Mutual exclusion between transactions Cause violations between epochs Read-test-write cycle  RAW Not needed between epochs TLS already provides mutual exclusion!

36 Latches: Aggressive Acquire Acquire latch_cnt++ …work… latch_cnt-- Homefree latch_cnt++ …work… (enqueue release) Commit work latch_cnt-- Homefree latch_cnt++ …work… (enqueue release) Commit work latch_cnt-- Release Large critical section

37 Latches: Lazy Acquire Acquire …work… Release Homefree (enqueue acquire) …work… (enqueue release) Acquire Commit work Release Homefree (enqueue acquire) …work… (enqueue release) Acquire Commit work Release Small critical sections

HARDWARE

39 TLS in Database Systems Non-Database TLS Time TLS in Database Systems Large epochs: More dependences Must tolerate More state Bigger buffers Large epochs: More dependences Must tolerate More state Bigger buffers Concurrent transactions

40 Feedback Loop I know this is parallel! for() { do_work(); } par_for() { do_work(); } Must…Make …Faster think feed back feed back feed back feed back feed back feed back feed back feed back feed back feed back feed

41 Violations == Feedback *p= *q= =*p =*q Sequential Time *p= *q= =*p R2 Violation! =*p =*q Parallel 0x0FD8  0xFD20 0x0FC0  0xFC18 Must…Make …Faster

42 Eliminating Violations *p= *q= =*p R2 Violation! =*p =*q Parallel *q= =*q Violation! Eliminate *p Dep. Time 0x0FD8  0xFD20 0x0FC0  0xFC18 Optimization may make slower?

43 Tolerating Violations: Sub-epochs Time *q= Violation! Sub-epochs =*q *q= =*q Violation! Eliminate *p Dep.

44 Sub-epochs Started periodically by hardware How many? When to start? Hardware implementation Just like epochs Use more epoch contexts No need to check violations between sub-epochs within an epoch *q= Violation! Sub-epochs =*q

45 Old TLS Design CPU L1 $ L2 $ Rest of memory system CPU L1 $ CPU L1 $ CPU L1 $ Buffer speculative state in write back L1 cache Invalidation Detect violations through invalidations Rest of system only sees committed data Restart by invalidating speculative lines Problems: L1 cache not large enough Later epochs only get values on commit Problems: L1 cache not large enough Later epochs only get values on commit

46 New Cache Design CPU L1 $ L2 $ Rest of memory system CPU L1 $ CPU L1 $ CPU L1 $ Buffer speculative and non-speculative state for all epochs in L2 Speculative writes immediately visible to L2 (and later epochs) Detect violations at lookup time Invalidation coherence between L2 caches L1 $ L2 $ Invalidation Restart by invalidating speculative lines

47 New Features CPU L1 $ L2 $ Rest of memory system CPU L1 $ CPU L1 $ CPU L1 $ L2 $ Speculative state in L1 and L2 cache Cache line replication (versions) Data dependence tracking within cache Speculative victim cache New!

48 Scaling Sequential 2 CPUs4 CPUs8 CPUs Time (normalized) Sequential 2 CPUs4 CPUs8 CPUs Modified with items/transaction Idle CPU Failed Speculation IUO Mutex Stall Cache Miss Instruction Execution

49 Evaluating a 4-CPU system Sequential TLS Seq No Sub-epoch Baseline No Speculation Time (normalized) Idle CPU Failed Speculation IUO Mutex Stall Cache Miss Instruction Execution Original benchmark run on 1 CPU Parallelized benchmark run on 1 CPU Without sub-epoch support Parallel execution Ignore violations (Amdahl’s Law limit)

50 Sub-epochs: How many/How big? Supporting more sub-epochs is better Spacing depends on location of violations Even spacing is good enough Supporting more sub-epochs is better Spacing depends on location of violations Even spacing is good enough

51 Query Execution Actions taken by a query: Bring pages into buffer pool Acquire and release latches & locks Allocate/free memory Allocate/free and use cursors Use B-trees Generate log entries These generate violations.

52 Applying TLS 1. Parallelize loop 2. Run benchmark 3. Remove bottleneck 4. Go to 2 Time

53 Outline Hardware Developer Transaction Programmer DBMS Programmer

54 Violation Prediction *q= =*q Done Predict Dependences Predictor *q= =*q Violation! Eliminate R1/W2 Time

55 Violation Prediction *q= =*q Done Predict Dependences Predictor Time Predictor problems: Large epochs  many predictions Failed prediction  violation Incorrect prediction  large stall Two predictors required: Last store Dependent load

56 TLS Execution *p= *q= =*p R2 Violation! =*p =*q CPU 1 L1 $ L2 $ Rest of memory system CPU 2 L1 $ CPU 3 L1 $ CPU 4 L1 $

57 TLS Execution *p= *q= =*p R2 Violation! =*p =*q *p 11 *s 1 *t valid CPU 2CPU 3 SM SL SM SL CPU 4 SM SL CPU 1 SM SL

58 TLS Execution *p= *q= =*p R2 Violation! =*p =*q *p 11 *s 1 *t valid CPU 2CPU 3 SM SL SM SL CPU 4 SM SL CPU 1 SM SL

59 TLS Execution *p= *q= =*p R2 Violation! =*p =*q *p 111 valid CPU 2CPU 3 SM SL SM SL CPU 4 SM SL CPU 1 SM SL

60 TLS Execution *p= *q= =*p R2 Violation! =*p =*q *q 1 *p 111 valid CPU 2CPU 3 SM SL SM SL CPU 4 SM SL CPU 1 SM SL 1

61 TLS Execution *p= *q= =*p R2 Violation! =*p =*q 1 *q 1 *p 111 valid CPU 2CPU 3 SM SL SM SL CPU 4 SM SL CPU 1 SM SL 1

62 Replication *p= *q= =*p R2 Violation! =*p =*q *q= *p *q Can’t invalidate line if it contains two epoch’s changes valid CPU 2CPU 3 SM SL SM SL CPU 4 SM SL CPU 1 SM SL

63 Replication *p= *q= =*p R2 Violation! =*p =*q *q= 1 *p *q 1 11 valid CPU 2CPU 3 SM SL SM SL CPU 4 SM SL CPU 1 SM SL

64 Replication *p= *q= =*p R2 Violation! =*p =*q *q= Makes epochs independent Enables sub-epochs 1 *p *q 1 11 valid CPU 2CPU 3 SM SL SM SL CPU 4 SM SL CPU 1 SM SL

65 Sub-epochs *p= *q= =*p =*q *p 1 *q 1 1 valid CPU 1bCPU 1c SM SL SM SL CPU 1d SM SL CPU 1a SM SL 1 1 *q= 1 1 *p= *p a 1b 1c 1d … … … … Uses more epoch contexts Detection/buffering/rewind is “free” More replication: Speculative victim cache

66 get_page() wrapper page_t *get_page_wrapper(pageid_t id) { static tls_mutex mut; page_t *ret; tls_escape_speculation(); check_get_arguments(id); tls_acquire_mutex(&mut); ret = get_page(id); tls_release_mutex(&mut); tls_on_violation(put, ret); tls_resume_speculation() return ret; }  Wraps get_page()

67 get_page() wrapper page_t *get_page_wrapper(pageid_t id) { static tls_mutex mut; page_t *ret; tls_escape_speculation(); check_get_arguments(id); tls_acquire_mutex(&mut); ret = get_page(id); tls_release_mutex(&mut); tls_on_violation(put, ret); tls_resume_speculation() return ret; }  No violations while calling get_page()

68  May get bad input data from speculative thread! get_page() wrapper page_t *get_page_wrapper(pageid_t id) { static tls_mutex mut; page_t *ret; tls_escape_speculation(); check_get_arguments(id); tls_acquire_mutex(&mut); ret = get_page(id); tls_release_mutex(&mut); tls_on_violation(put, ret); tls_resume_speculation() return ret; }

69 get_page() wrapper page_t *get_page_wrapper(pageid_t id) { static tls_mutex mut; page_t *ret; tls_escape_speculation(); check_get_arguments(id); tls_acquire_mutex(&mut); ret = get_page(id); tls_release_mutex(&mut); tls_on_violation(put, ret); tls_resume_speculation() return ret; }  Only one epoch per transaction at a time

70  How to undo get_page() get_page() wrapper page_t *get_page_wrapper(pageid_t id) { static tls_mutex mut; page_t *ret; tls_escape_speculation(); check_get_arguments(id); tls_acquire_mutex(&mut); ret = get_page(id); tls_release_mutex(&mut); tls_on_violation(put, ret); tls_resume_speculation() return ret; }

71 get_page() wrapper page_t *get_page_wrapper(pageid_t id) { static tls_mutex mut; page_t *ret; tls_escape_speculation(); check_get_arguments(id); tls_acquire_mutex(&mut); ret = get_page(id); tls_release_mutex(&mut); tls_on_violation(put, ret); tls_resume_speculation() return ret; } Isolated Undoing this operation does not cause cascading aborts Undoable Easy way to return system to initial state Can also be used for: Cursor management malloc()

72 TPC-C Benchmark Company Warehouse 1Warehouse W District 1District 2District 10 Cust 1Cust 2Cust 3k

73 TPC-C Benchmark Warehouse W District W*10 Customer W*30k Order W*30k+ Order Line W*300k+ New Order W*9k+ History W*30k+ Stock W*100k Item 100k 10 3k W 100k

74 What is TLS? while(cond) { x = hash[i];... hash[j] = y;... } Time = hash[3];... hash[10] =... = hash[19];... hash[21] =... = hash[33];... hash[30] =... = hash[10];... hash[25] =...

75 What is TLS? while(cond) { x = hash[i];... hash[j] = y;... } Time = hash[3];... hash[10] =... = hash[19];... hash[21] =... = hash[33];... hash[30] =... = hash[10];... hash[25] =... Processor AProcessor BProcessor CProcessor D Thread 1Thread 2Thread 3Thread 4

76 What is TLS? while(cond) { x = hash[i];... hash[j] = y;... } Time = hash[3];... hash[10] =... = hash[19];... hash[21] =... = hash[33];... hash[30] =... = hash[10];... hash[25] =... Processor AProcessor BProcessor CProcessor D Thread 1Thread 2Thread 3Thread 4 Violation!

77 What is TLS? while(cond) { x = hash[i];... hash[j] = y;... } Time = hash[3];... hash[10] =... attempt_commit() = hash[19];... hash[21] =... attempt_commit() = hash[33];... hash[30] =... attempt_commit() Processor AProcessor BProcessor CProcessor D Thread 1Thread 2Thread 3Thread 4 Violation!  = hash[10];... hash[25] =... attempt_commit()

78 What is TLS? while(cond) { x = hash[i];... hash[j] = y;... } Time = hash[3];... hash[10] =... attempt_commit() = hash[19];... hash[21] =... attempt_commit() = hash[33];... hash[30] =... attempt_commit() = hash[10];... hash[25] =... attempt_commit() Processor AProcessor BProcessor CProcessor D Thread 1Thread 2Thread 3Thread 4 Violation!  Redo = hash[10];... hash[25] =... attempt_commit() Thread 4

79 TLS Hardware Design What’s new? Large threads Epochs will communicate Complex control flow Huge legacy code base How does hardware change? Store state in L2 instead of L1 Reversible atomic operations Tolerate dependences Aggressive update propagation (implicit forwarding) Sub-epochs

80 L1 Cache Line SL bit “L2 cache knows this line has been speculatively loaded” On violation or commit: clear SM bit “This line contains speculative changes” On commit: clear On violation: SM  Invalid  Otherwise, just like a normal cache  SL SM Valid LRU Tag Data

81 escaping Speculation  Speculative epoch wants to make system visible change! Ignore SM lines while escaped Stale bit “This line may be outdated by speculative work.” On violation or commit: clear SL SM Valid LRU Tag Data Stale

82 L1 to L2 communication L2 sees all stores (write through) L2 sees first load of an epoch NotifySL message  L2 can track data dependences!

83 L1 Changes Summary Add three bits to each line SL SM Stale Modify tag match to recognize bits Add queue of NotifySL requests SL SM Valid LRU Tag Data Stale

84 L2 Cache Line SL SM Fine Grained SM ValidDirtyExclusive LRU Tag Data SL SM CPU1CPU2 Cache line can be: Modified by one CPU Loaded by multiple CPUs

85 Cache Line Conflicts Three classes of conflict: Epoch 2 stores, epoch 1 loads Need “old” version to load Epoch 1 stores, epoch 2 stores Need to keep changes separate Epoch 1 loads, epoch 2 stores Need to be able to discard line on violation  Need a way of storing multiple conflicting versions in the cache 

86 Cache line replication On conflict, replicate line Split line into two copies Divide SM and SL bits at split point Divide directory bits at split point

87 Replication Problems Complicates line lookup Need to find all replicas and select “best” Best == most recent replica Change management On write, update all later copies Also need to find all more speculative replicas to check for violations On commit must get rid of stale lines Invalidation Required Buffer (IRB)

88 Victim Cache How do you deal with a full cache set? Use a victim cache Holds evicted lines without losing SM & SL bits Must be fast  every cache lookup needs to know: Do I have the “best” replica of this line? Critical path Do I cause a violation? Not on critical path

89 Summary of Hardware Support Sub-epochs Violations hurt less! Shared cache TLS support Faster communication More room to store state RAOs Don’t speculate on known operations Reduces amount of speculative state

90 Summary of Hardware Changes Sub-epochs Checkpoint register state Needs replicas in cache Shared cache TLS support Speculative L1 Replication in L1 and L2 Speculative victim cache Invalidation Required Buffer RAOs Suspend/resume speculation Mutexes “Undo list”

91 TLS Execution *p= *q= =*p R2 Violation! =*p =*q CPU L1 $ L2 $ Rest of memory system CPU L1 $ CPU L1 $ CPU L1 $ *p Invalidation *p *q *p *q

92

93

94

95 Problems with Old Cache Design Database epochs are large L1 cache not large enough Sub-epochs add more state L1 cache not associative enough Database epochs communicate L1 cache only communicates committed data

96 Intro Summary TLS makes intra-transaction parallelism easy Divide transaction into epochs Hardware support: Detect violations Restart to recover Sub-epochs mitigate penalty Buffer state New process: Modify software  avoid violations  improve performance

97 Money! pwhile() {} Money! pwhile() {} Money! pwhile() {} Money! pwhile() {} Money! pwhile() {} Money! pwhile() {} Money! pwhile() {} Money! pwhile() {} Money! pwhile() {} Money! pwhile() {} while(too_slow) make_faster(); while(too_slow) make_faster(); while(too_slow) make_faster(); while(too_slow) make_faster(); The Many Faces of Ogg Must tune reorder buffer…

98 Must tune reorder buffer… The Many Faces of Ogg Duh… pwhile() {} Duh… pwhile() {} Duh… pwhile() {} Duh… pwhile() {} Duh… pwhile() {} Duh… pwhile() {} Duh… pwhile() {} Duh… pwhile() {} Duh… pwhile() {} Money! pwhile() {} while(too_slow) make_faster(); while(too_slow) make_faster(); while(too_slow) make_faster(); while(too_slow) make_faster();

99 Removing Bottlenecks Three general techniques: Partition data structures malloc Postpone operations until non-speculative Latches and locks, log entries Handle speculation manually Buffer pool

100 Bottlenecks Encoutered Buffer pool Latches & Locks Malloc/free Cursor queues Error checks False sharing B-tree performance optimization Log entries

101 Must tune reorder buffer… The Many Faces of Ogg Duh… pwhile() {} Duh… pwhile() {} Duh… pwhile() {} Duh… pwhile() {} Duh… pwhile() {} Duh… pwhile() {} Duh… pwhile() {} Duh… pwhile() {} Duh… pwhile() {} Money! pwhile() {} while(too_slow) make_faster(); while(too_slow) make_faster(); while(too_slow) make_faster(); while(too_slow) make_faster();

102 Performance on 4 CPUs Unmodified benchmark:Modified benchmark:

103 Incremental Parallelization Time 4 CPUs

104 Scaling Unmodified benchmark:Modified benchmark:

105 Parallelization is Hard Programmer Effort Performance Improvement Hand Parallelization What we want Parallelizing Compiler Tuning

106 Case Study: New Order (TPC-C) Begin transaction { } End transaction

107 Case Study: New Order (TPC-C) Begin transaction { Read customer info Read & increment order # Create new order } End transaction

108 Case Study: New Order (TPC-C) Begin transaction { Read customer info Read & increment order # Create new order For each item in order { Get item info Decrement count in stock Record order info } } End transaction

109 Case Study: New Order (TPC-C) Begin transaction { Read customer info Read & increment order # Create new order For each item in order { Get item info Decrement count in stock Record order info } } End transaction 80% of transaction execution time

110 Case Study: New Order (TPC-C) Begin transaction { Read customer info Read & increment order # Create new order For each item in order { Get item info Decrement count in stock Record order info } } End transaction 80% of transaction execution time

111 The Many Faces of Ogg Duh… pwhile() {} while(too_slow) make_faster(); Duh… pwhile() {} Duh… pwhile() {} Duh… pwhile() {} Duh… pwhile() {} Duh… pwhile() {} Duh… pwhile() {} Duh… pwhile() {} Duh… pwhile() {} Money! pwhile() {} while(too_slow) make_faster(); while(too_slow) make_faster(); while(too_slow) make_faster(); Must tune reorder buffer…

Step 2: Changing the Software

113 No problem! Loop is easy to parallelize using TLS! Not really Calls into DBMS invoke complex operations Ogg needs to do some work Many operations in DBMS are parallel Not written with TLS in mind!

114 Resource Management Mutexes acquired and released Locks locked and unlocked Cursors pushed and popped from free stack Memory allocated and freed Buffer pool entries Acquired and released

115 Mutexes: Deadlock? Problem: Re-ordered acquire/release operations!  Possibly introduced deadlock? Solutions: Avoidance: Static acquire order Recovery: Detect deadlock and violate

116 Locks Like mutexes, but: Allows multiple readers No memory overhead when not held Often held for much longer Treat similarly to mutexes

117 Cursors Used for traversing B-trees Pre-allocated, kept in pools

118 Maintaining Cursor Pool Get Use Release head

119 Maintaining Cursor Pool Get Use Release head

120 Maintaining Cursor Pool Get Use Release head

121 Maintaining Cursor Pool Get Use Release head Get Use Release Violation!

122 Parallelizing Cursor Pool Use per-CPU pools: Modify code: each CPU gets its own pool No sharing == no violations! Requires cpuid() instruction Get Use Release head Get Use Release head

123 Memory Allocation Problem: malloc() metadata causes dependences Solutions: Per-cpu memory pools Parallelized free list

124 The Log Append records to global log Appending causes dependence Can’t parallelize: Global log sequence number (LSN) Generate log records in buffers Assign LSNs when homefree

125 B-Trees Leaf pages contain free space counts Inserts of random records – o.k. Inserting adjacent records Dependence on decrementing count Page splits Infrequent

126 Other Dependences Statistics gathering Error checks False sharing

127 Related Work Lots of work in TLS: Multiscalar (Wisconsin) Hydra (Stanford) IACOMA (Illinois) RAW (MIT) Hand parallelizing using TLS: Manohar Prabhu and Kunle Olukotun (PPoPP’03)

Any questions?

129 Why is this a problem? B-tree insertion into ORDERLINE table Key is ol_n DBMS does not know that keys will be sequential Each insert usually updates the same btree page for(ol_n=0; ol_n<15; ol_n++) { INSERT into ORDERLINE (ol_n, ol_item, ol_cnt) VALUES (:ol_n, :ol_item, :ol_cnt); }

130 Sequential Btree Inserts 4 free 1 itemfree 4 item free 3 item free 2 item free

131 Improvement: SM Versioning Blow away all SM lines on a violation? May be bad! Instead: On primary violation: Only invalidate locally modified SM lines On secondary violation: Invalidate all SM lines Needs one more bit: LocalSM May decrease number of misses to L2 on violation

132 Outline Store state in L2 instead of L1 Reversible atomic operations Tolerate dependences Aggressive update propagation (implicit forwarding) Sub-epochs Results and analysis

133 Outline Store state in L2 instead of L1 Reversible atomic operations Tolerate dependences Aggressive update propagation (implicit forwarding) Sub-epochs Results and analysis

134 Tolerating dependences Aggressive update propagation Get for free! Sub-epochs Periodically checkpoint epochs Every N instructions? Picking N may be interesting Perhaps checkpoints could be set before the location of previous violations?

135 Outline Store state in L2 instead of L1 Reversible atomic operations Tolerate dependences Aggressive update propagation (implicit forwarding) Sub-epochs Results and analysis

136 Why not faster? Possible reasons: Idle cpus RAO mutexes Violations Cache effects Data dependences

137 Why not faster? Possible reasons: Idle cpus 9 epochs/region average Two bundles of four and one of one ¼ of cpu cycles wasted! RAO mutexes Violations Cache effects Data dependences

138 Why not faster? Possible reasons: Idle cpus RAO mutexes Not implemented yet Ooops! Violations Cache effects Data dependences

139 Why not faster? Possible reasons: Idle cpus RAO mutexes Violations 21/969 epochs violated Distance 1 “magic synchronized” 2.2Mcycles (over 4 cpus) About 1.5% Cache effects Data dependences

140 Why not faster? Possible reasons: Idle cpus RAO mutexes Violations Cache effects Deserves its own slide. Data dependences

141 Cache effects of speculation Only 20% of references are speculative! Speculative references have small impact on non-speculative hit rate (<1%) Speculative refs miss a lot in L1 9-15% for reads, 2-6% for writes L2 saw HUGE increase in traffic 152k refs to 3474k refs Spec/non spec lines are thrashing from L1s

142 Why not faster? Possible reasons: Idle cpus RAO mutexes Violations Cache effects Data dependences Oh yeah! Btree item count Split up btree insert? alloc and write Do alloc as RAO Needs more thought

143 L2 Cache Line SL SM Fine Grained SM ValidDirtyExclusive LRU Tag Data SL SM CPU1CPU2 Set 1 Set 2

144 Why are you here? Want faster database systems Have funky new hardware – Thread Level Speculation (TLS) How can we apply TLS to database systems? Side question: Is this a VLDB or an ASPLOS talk?

145 How? Divide transaction into TLS-threads Run TLS-threads in parallel; maintain sequential semantics Profit!

146 Why parallelize transactions? Decrease transaction latency Increase concurrency while avoiding concurrency control bottleneck A.k.a.: use more CPUs, same # of xactions The obvious: Database performance matters

147 Shopping List What do we need? (research scope) Cheap hardware Thread Level Speculation (TLS) Minor changes allowed. Important database application TPC-C Almost no changes allowed! Modular database system BerkeleyDB Some changes allowed.

148 Outline TLS Hardware The Benchmark (TPC-C) Changing the database system Results Conclusions

149 Outline TLS Hardware The Benchmark (TPC-C) Changing the database system Results Conclusions

150 What’s new? Database operations are: Large Complex Large TLS-threads Lots of dependences Difficult to analyze Want: Programmer optimization effort = faster program

151 Hardware changes summary Must tolerate dependences Prediction? Implicit forwarding? May need larger caches May need larger associativity

152 Outline TLS Hardware The Benchmark (TPC-C) Changing the database system Results Conclusions

153 Parallelization Strategy 1. Pick a benchmark 2. Parallelize a loop 3. Analyze dependences 4. Optimize away dependences 5. Evaluate performance 6. If not satisfied, goto 3

154 Outline TLS Hardware The Benchmark (TPC-C) Changing the database system Resource management The log B-trees False sharing Results Conclusions

155 Outline TLS Hardware The Benchmark (TPC-C) Changing the database system Results Conclusions

156 Results Viola simulator: Single CPI Perfect violation prediction No memory system 4 cpus Exhaustive dependence tracking Currently working on an out-of-order superscalar simulation (cello) 10 transaction warm-up Measure 100 transactions

157 Outline TLS Hardware The Benchmark (TPC-C) Changing the database system Results Conclusions

158 Conclusions TLS can improve transaction latency Violation predictors important Iff dependences must be tolerated TLS makes hand parallelizing easier

159 Improving Database Performance How to improve performance: Parallelize transaction Increase number of concurrent transactions Both of these require independence of database operations!

160 Case Study: New Order (TPC-C) Begin transaction { } End transaction

161 Case Study: New Order (TPC-C) Begin transaction { Read customer info ( customer, warehouse ) Read & increment order # ( district ) Create new order ( orders, neworder ) } End transaction

162 Case Study: New Order (TPC-C) Begin transaction { Read customer info ( customer, warehouse ) Read & increment order # ( district ) Create new order ( orders, neworder ) For each item in order { Get item info ( item ) Decrement count in stock ( stock ) Record order info ( orderline ) } } End transaction

163 Case Study: New Order (TPC-C) Begin transaction { Read customer info ( customer, warehouse ) Read & increment order # ( district ) Create new order ( orders, neworder ) For each item in order { Get item info ( item ) Decrement count in stock ( stock ) Record order info ( orderline ) } } End transaction Parallelize this loop

164 Case Study: New Order (TPC-C) Begin transaction { Read customer info ( customer, warehouse ) Read & increment order # ( district ) Create new order ( orders, neworder ) For each item in order { Get item info ( item ) Decrement count in stock ( stock ) Record order info ( orderline ) } } End transaction Parallelize this loop

165 Implementing on a Real DB Using BerkeleyDB “Table” == “Database” Give database any arbitrary key  will return arbitrary data (bytes) Use struct s for keys and rows Database provides ACID through: Transactions Locking (page level) Storage management Provides indexing using b-trees

166 Parallelizing a Transaction For each item in order { Get item info ( item ) Decrement count in stock ( stock ) Record order info ( order line ) }

167 Parallelizing a Transaction For each item in order { Get item info ( item ) Decrement count in stock ( stock ) Record order info ( order line ) } Get cursor from pool Use cursor to traverse b-tree Find row, lock page for row Release cursor to pool Get cursor from pool Use cursor to traverse b-tree Find row, lock page for row Release cursor to pool

168 Maintaining Cursor Pool Get Use Release head

169 Maintaining Cursor Pool Get Use Release head

170 Maintaining Cursor Pool Get Use Release head

171 Maintaining Cursor Pool Get Use Release head Get Use Release Violation!

172 Parallelizing Cursor Pool 1 Use per-CPU pools: Modify code: each CPU gets its own pool No sharing == no violations! Requires cpuid() instruction Get Use Release head Get Use Release head

173 Parallelizing Cursor Pool 2 Dequeue and enqueue: atomic and unordered Delay enqueue until end of thread Forces separate pools Avoids modification of data struct Get Use Release head Get Use Release

174 Parallelizing Cursor Pool 3 Atomic unordered dequeue & enqueue Cursor struct is “TLS unordered” Struct defined as a byte range in memory Get Use Release head Get Use Release Get Use Release Get Use Release

175 Parallelizing Cursor Pool 4 Mutex protect dequeue & enqueue; declare pointer to cursor struct to be “TLS unordered” Any access through pointer does not have TLS applied Pointer is tainted, any copies of it keep this property

176 Problems with 3 & 4 What exactly is the boundary of a structure? How do you express the concept of object in a loosely-typed language like C? A byte range or a pointer is only an approximation. Dynamically allocated sub-components?

177 Mutexes in a TLS world Two types of threads: “real” threads TLS threads Two types of mutexes: Inter-real-thread Inter-TLS-thread

178 Inter-real-thread Mutexes Acquire == get mutex for all TLS threads Release == release for current TLS thread May still be held by another TLS thread!

179 Inter-TLS-thread Mutexes Should never interact between two real threads Implies no TLS ordering between TLS threads while mutex is held But what do to on a violation? Can’t just throw away changes to memory Must undo operations performed in critical section

180 Parallelizing Databases using TLS Split transactions into threads Threads created are large 60k+ instructions 16kB of speculative state More dependences between threads How do we design a machine which can handle these large threads?

181 The “Old” Way P L1 $ L2 $ P L1 $ P P Committed state L3 $ P L1 $ L2 $ P L1 $ P P Memory System … Speculative state

182 The “Old” Way Advantages Each epoch has its own L1 cache Epoch state does not intermix Disadvantages L1 cache is too small! Full cache == dead meat No shared speculative memory

183 The “New” Way L2 cache is huge! State of the art in caches, Power5: 1.92MB 10-way L2 32kB 4-way L1 Shared speculative memory “for free” Keeps TLS logic off of the critical path

184 TLS Shared L2 Design L1 Write-through write-no-allocate [CullerSinghGupta99] Easy to understand and reason about Writes visible to L2 – simplifies shared speculative memory L2 cache: shared cache architecture with replication Rest of memory: distributed TLS coherence

185 TLS Shared L2 Design P L1 $ L2 $ P L1 $ P P “Real” speculative state L3 $ P L1 $ L2 $ P L1 $ P P Memory System … Cached speculative state Explain from the top down

186 TLS Shared L2 Design P L1 $ L2 $ P L1 $ P P “Real” speculative state L3 $ P L1 $ L2 $ P L1 $ P P Memory System … Cached speculative state

Part II: Dealing with dependences

188 Predictor Design How do you design a predictor that: Identifies violating loads Identifies the last store that causes them Only triggers when they cause a problem Has very very high accuracy ???

189 Sub-epoch design Like checkpointing Leave “holes” in epoch # space Every 5k instructions start a new epoch Uses more cache to buffer changes More strain on associativity/victim cache Uses more epoch contexts

190 Summary Supporting large epochs needs: Buffer state in L2 instead of L1 Shared speculative memory Replication Victim cache Sub-epochs

Any questions?