Download presentation
Presentation is loading. Please wait.
1
Tolerating Dependences Between Large Speculative Threads Via Sub-Threads Chris Colohan 1,2, Anastassia Ailamaki 2, J. Gregory Steffan 3 and Todd C. Mowry 2,4 1 Google, Inc. 2 Carnegie Mellon University 3 University of Toronto 4 Intel Research Pittsburgh
2
Copyright 2006 Chris Colohan 2 Thread Level Speculation (TLS) *p= *q= =*p =*q Sequential Time Parallel *p= *q= =*p =*q =*p =*q Thread 1Thread 2
3
Copyright 2006 Chris Colohan 3 Thread Level Speculation (TLS) *p= *q= =*p =*q Sequential Time *p= *q= =*p R2 Violation! =*p =*q Parallel Use threads Detect violations Restart to recover Buffer state Worst case: Sequential Best case: Fully parallel Data dependences limit performance. Thread 1Thread 2
4
Copyright 2006 Chris Colohan 4 Violations as a Feedback Signal *p= *q= =*p =*q Sequential Time *p= *q= =*p R2 Violation! =*p =*q Parallel 0x0FD8 0xFD20 0x0FC0 0xFC18 Must…Make …Faster
5
Copyright 2006 Chris Colohan 5 Violations as a Feedback Signal *p= *q= =*p =*q Sequential Time *p= *q= =*p R2 Violation! =*p =*q Parallel
6
Copyright 2006 Chris Colohan 6 Eliminating Violations *p= *q= =*p R2 Violation! =*p =*q Parallel *q= =*q Violation! Eliminate *p Dep. Time 0x0FD8 0xFD20 0x0FC0 0xFC18 Optimization may make slower? All-or-nothing execution makes optimization harder
7
Copyright 2006 Chris Colohan 7 Tolerating Violations: Sub-threads Time *q= Violation! Sub-threads =*q *q= =*q Violation! Eliminate *p Dep.
8
Copyright 2006 Chris Colohan 8 Sub-threads Periodic checkpoints of a speculative thread Makes TLS work well with: Large speculative threads Unpredictable frequent dependences *q= Violation! Sub-threads =*q Speed up database transaction response time by a factor of 1.9 to 2.9. Speed up database transaction response time by a factor of 1.9 to 2.9.
9
Copyright 2006 Chris Colohan 9 Overview TLS and database transactions Buffering large speculative threads Hardware support for sub-threads Results
10
Copyright 2006 Chris Colohan 10 Case Study: New Order (TPC-C) Only dependence is the quantity field Very unlikely to occur (1/100,000) GET cust_info FROM customer; UPDATE district WITH order_id; INSERT order_id INTO new_order; foreach(item) { GET quantity FROM stock WHERE i_id=item; UPDATE stock WITH quantity-1 WHERE i_id=item; INSERT item INTO order_line; } 78% of transaction execution time
11
Copyright 2006 Chris Colohan 11 Case Study: New Order (TPC-C) GET cust_info FROM customer; UPDATE district WITH order_id; INSERT order_id INTO new_order; foreach (item) { GET quantity FROM stock WHERE i_id=item; UPDATE stock WITH quantity-1 WHERE i_id=item; INSERT item INTO order_line; } GET cust_info FROM customer; UPDATE district WITH order_id; INSERT order_id INTO new_order; TLS_foreach (item) { GET quantity FROM stock WHERE i_id=item; UPDATE stock WITH quantity-1 WHERE i_id=item; INSERT item INTO order_line; } GET cust_info FROM customer; UPDATE district WITH order_id; INSERT order_id INTO new_order; TLS_foreach (item) { GET quantity FROM stock WHERE i_id=item; UPDATE stock WITH quantity-1 WHERE i_id=item; INSERT item INTO order_line; }
12
Copyright 2006 Chris Colohan 12 Optimizing the DBMS: New Order 0 0.25 0.5 0.75 1 1.25 Sequential No Optimizations Latches Locks Malloc/Free Buffer Pool Cursor Queue Error Checks False Sharing B-Tree Logging Time (normalized) Idle CPU Failed Cache Miss Busy This process took me 30 days and <1200 lines of code. This process took me 30 days and <1200 lines of code. Results from Colohan, Ailamaki, Steffan and Mowry VLDB2005
13
Copyright 2006 Chris Colohan 13 Overview TLS and database transactions Buffering large speculative threads Hardware support for sub-threads Results
14
Copyright 2006 Chris Colohan 14 Threads from Transactions Thread Size (Dyn. Instrs.) Dependent loads New Order 62k75 New Order 150 61k75 Delivery 33k20 Delivery Outer490k34 Stock Level 17k29 Challenge: buffering large threads
15
Copyright 2006 Chris Colohan 15 Buffering Large Threads Prior work: Cintra et. al. [ISCA’00] Oldest thread in each chip can store state in L2 Prvulovic et. al. [ISCA’01] Speculative state can overflow into RAM What we need: Fast Deals well with many forward dependences Easy to add sub-epoch support Buffer speculative state in shared L2
16
Copyright 2006 Chris Colohan 16 L1 cache changes Add Speculatively Modified bit per line Line modified by current thread or an older thread On violation invalidate all SM lines dataS
17
Copyright 2006 Chris Colohan 17 L2 cache changes Add Speculatively Modified and Speculatively Loaded bit per line, one pair per speculative thread If two threads modify a line, replicate Within the associative set Add a small speculative victim cache Catch over-replicated lines dataSMSLSMSL T1T2
18
Copyright 2006 Chris Colohan 18 Overview TLS and database transactions Buffering large speculative threads Hardware support for sub-threads Results
19
Copyright 2006 Chris Colohan 19 data Sub-thread support Add one thread contexts per sub-thread No dependence tracking between sub-threads SMSL SMSL T1a SMSL T1T2T1b dataSMSL T2bT2a
20
Copyright 2006 Chris Colohan 20 When to start new sub-threads? Right before “high-risk” loads? Results show that starting periodically works well *q= =*q Violation! =*q
21
Copyright 2006 Chris Colohan 21 Secondary Violations Sub-thread start table: How far to rewind on a secondary violation? *q= Violation! =*q
22
Copyright 2006 Chris Colohan 22 Secondary Violations Sub-thread start table: How far to rewind on a secondary violation? *q= Violation! =*q
23
Copyright 2006 Chris Colohan 23 Overview TLS and database transactions Buffering large speculative threads Hardware support for sub-threads Results
24
Copyright 2006 Chris Colohan 24 Experimental Setup Detailed simulation Superscalar, out-of-order, 128 entry reorder buffer Memory hierarchy modeled in detail TPC-C transactions on BerkeleyDB In-core database Single user Single warehouse Measure interval of 100 transactions Measuring latency not throughput CPU 32KB 4-way L1 $ Rest of memory system CPU 32KB 4-way L1 $ CPU 32KB 4-way L1 $ CPU 32KB 4-way L1 $ 2MB 4-way L2 $
25
Copyright 2006 Chris Colohan 25 Time (normalized) TPC-C on 4 CPUs 0 0.2 0.4 0.6 0.8 1 1.2 New Order New Order 150 Delivery Delivery Outer Stock Level Payment Order Status Without sub-thread support With sub-threads Ignore violations (Amdahl’s Law limit) Idle CPU Failed Cache Miss Busy NSLNSLNSLNSLNSLNSLNSL N = no sub-threadsS = with sub-threadsL = limit, ignoring violations
26
Copyright 2006 Chris Colohan 26 TPC-C on 4 CPUs 0 0.2 0.4 0.6 0.8 1 1.2 Idle CPU Failed Cache Miss Busy Time (normalized) New Order New Order 150 Delivery Delivery Outer Stock Level Payment Order Status NSLNSLNSLNSLNSLNSLNSL N = no sub-threadsS = with sub-threadsL = limit, ignoring violations Sub-threads improve performance by limiting the impact of failed speculation Sub-threads improve performance by limiting the impact of failed speculation
27
Copyright 2006 Chris Colohan 27 TPC-C on 4 CPUs 0 0.2 0.4 0.6 0.8 1 1.2 Idle CPU Failed Cache Miss Busy Time (normalized) New Order New Order 150 Delivery Delivery Outer Stock Level Payment Order Status NSLNSLNSLNSLNSLNSLNSL N = no sub-threadsS = with sub-threadsL = limit, ignoring violations Sub-threads have minimal impact on cache misses Sub-threads have minimal impact on cache misses
28
Copyright 2006 Chris Colohan 28 Victim Cache Usage 4-way8-way16-way New Order54 40 New Order 15064390 Delivery14 00 Delivery Outer62 40 Stock Level40 00 L2 cache associativity Small victim cache is sufficient
29
Copyright 2006 Chris Colohan 29 Sub-thread size Time (normalized) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 250 500 100025005000 1000025000 250500 100025005000 1000025000 250 500 100025005000 1000025000 New Order Idle CPU Failed Cache Miss Busy 2 Sub-threads4 Sub-threads8 Sub-threads
30
Copyright 2006 Chris Colohan 30 Sub-thread size Time (normalized) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 250 500 100025005000 1000025000 250500 100025005000 1000025000 250 500 100025005000 1000025000 New Order Idle CPU Failed Cache Miss Busy 2 Sub-threads4 Sub-threads8 Sub-threads Periodically starting sub-threads works surprisingly well Periodically starting sub-threads works surprisingly well
31
Copyright 2006 Chris Colohan 31 Related Work Checkpointing Use cache to simulate larger reorder buffer [Martínez02] Tolerating dependences Selective re-execution [Sarangi05] Predicting and synchronizing dependences [many papers] Using speculation for manual parallelization As applied to SPEC [Prabhu03] TCC [Hammond04] TLS and Transactional Memory: Multiscalar, IACOMA, Hydra, RAW
32
Copyright 2006 Chris Colohan 32 Conclusion Sub-threads let TLS tolerate unpredictable dependences Makes incremental feedback-directed parallelization possible Makes TLS with large threads practical Can now parallelize database transactions Hardware: simple extensions to previous TLS schemes Speeds up 3 of 5 TPC-C transactions: By a factor of 1.9 to 2.9
33
Any questions?
34
BACKUP SLIDES FOLLOW
35
Copyright 2006 Chris Colohan 35 Why Parallelize Transactions? Do not use if you have no idle CPUs Database people only care about throughput! Some transactions are latency sensitive e.g., financial transactions Lock bound workloads Free up locks faster == more throughput!
36
Copyright 2006 Chris Colohan 36 Buffering Large Threads store X, 0x00 L1$ 0x00: 0x01: L2$ X 0x00: 0x01: L1$ 0x00: 0x01: XS1 Store and load bit per thread
37
Copyright 2006 Chris Colohan 37 Buffering Large Threads store X, 0x00 store A, 0x01 L1$ 0x00: 0x01: L2$ X A 0x00: L1$ 0x00: 0x01: X A S1 0x01:
38
Copyright 2006 Chris Colohan 38 Buffering Large Threads store X, 0x00 store A, 0x01load 0x00 L1$ 0x00: 0x01: L2$ X A 0x00: 0x01: L1$ 0x00: 0x01: X X A S1 L2
39
Copyright 2006 Chris Colohan 39 XL2 XS1 Buffering Large Threads store X, 0x00 store A, 0x01load 0x00 store Y, 0x00 L1$ 0x00: 0x01: L2$ X A 0x00: 0x01: L1$ 0x00: 0x01: XY AS1 YS2L2 Replicate line – one version per thread
40
Copyright 2006 Chris Colohan 40 Buffering Large Threads store X, 0x00 store A, 0x01load 0x00 load 0x01 store Y, 0x00 L1$ 0x00: 0x01: L2$ X A 0x00: 0x01: X A Y L1$ 0x00: 0x01: Y A S1 S2L2 S1L2
41
Copyright 2006 Chris Colohan 41 Buffering Large Threads store X, 0x00 store A, 0x01load 0x00 load 0x01 store Y, 0x00 store B, 0x01 L1$ 0x00: 0x01: L2$ X A 0x00: 0x01: X A L1$ 0x00: 0x01: Y A S1 YS2L2 S1L2 B B
42
Copyright 2006 Chris Colohan 42 Sub-thread Support store X, 0x00 store A, 0x01load 0x00 load 0x01 store Y, 0x00 store B, 0x01 L1$ 0x00: 0x01: L2$ X A 0x00: 0x01: X A L1$ 0x00: 0x01: S1 L2 B B Y YS2L2 a { b { Divide into two sub-threads Only roll back violated sub-thread
43
Copyright 2006 Chris Colohan 43 Sub-thread Support store X, 0x00 store A, 0x01load 0x00 load 0x01 store Y, 0x00 L1$ 0x00: 0x01: L2$ X A 0x00: 0x01: X A Y L1$ 0x00: 0x01: A S1a A A S2aL2a L2b Y a { b { Store and load bit per sub-thread store B, 0x01 B
44
Copyright 2006 Chris Colohan 44 A AA L2b S1a Sub-thread Support store X, 0x00 store A, 0x01load 0x00 load 0x01 store Y, 0x00 L1$ 0x00: 0x01: L2$ X A 0x00: 0x01: X Y L1$ 0x00: 0x01: Y S1a A S2aL2a B store B, 0x01 S1b AB a { b {
45
Copyright 2006 Chris Colohan 45 Buffer Pool Management CPU Buffer Pool get_page(5) ref: 1 put_page(5) ref: 0
46
Copyright 2006 Chris Colohan 46 get_page(5) put_page(5) Buffer Pool Management CPU Buffer Pool get_page(5) ref: 0 put_page(5) Time get_page(5) put_page(5) TLS ensures first thread gets page first. Who cares? TLS ensures first thread gets page first. Who cares? get_page(5) put_page(5)
47
Copyright 2006 Chris Colohan 47 Buffer Pool Management CPU Buffer Pool get_page(5) ref: 0 put_page(5) Time get_page(5) put_page(5) = Escape Speculation Escape speculation Invoke operation Store undo function Resume speculation Escape speculation Invoke operation Store undo function Resume speculation get_page(5) put_page(5) get_page(5)
48
Copyright 2006 Chris Colohan 48 Buffer Pool Management CPU Buffer Pool get_page(5) ref: 0 put_page(5) Time get_page(5) put_page(5) get_page(5) put_page(5) Not undoable! get_page(5) put_page(5) = Escape Speculation
49
Copyright 2006 Chris Colohan 49 Buffer Pool Management CPU Buffer Pool get_page(5) ref: 0 put_page(5) Time get_page(5) put_page(5) get_page(5) put_page(5) Delay put_page until end of thread Avoid dependence = Escape Speculation
50
Copyright 2006 Chris Colohan 50 TLS in Database Systems Non-Database TLS Time TLS in Database Systems Concurrent transactions Large threads: More dependences Must tolerate More state Bigger buffers Large threads: More dependences Must tolerate More state Bigger buffers Large threads: More dependences Must tolerate More state Bigger buffers Large threads: More dependences Must tolerate More state Bigger buffers
51
Copyright 2006 Chris Colohan 51 Feedback Loop I know this is parallel! for() { do_work(); } par_for() { do_work(); } Must…Make …Faster think feed back feed back feed back feed back feed back feed back feed back feed back feed back feed back feed
52
LATCHES
53
Copyright 2006 Chris Colohan 53 Latches Mutual exclusion between transactions Cause violations between threads Read-test-write cycle RAW Not needed between threads TLS already provides mutual exclusion!
54
Copyright 2006 Chris Colohan 54 Latches: Aggressive Acquire Acquire latch_cnt++ …work… latch_cnt-- Homefree latch_cnt++ …work… (enqueue release) Commit work latch_cnt-- Homefree latch_cnt++ …work… (enqueue release) Commit work latch_cnt-- Release Large critical section
55
Copyright 2006 Chris Colohan 55 Latches: Lazy Acquire Acquire …work… Release Homefree (enqueue acquire) …work… (enqueue release) Acquire Commit work Release Homefree (enqueue acquire) …work… (enqueue release) Acquire Commit work Release Small critical sections
56
Copyright 2006 Chris Colohan 56 Applying TLS 1. Parallelize loop 2. Run benchmark 3. Remove bottleneck 4. Go to 2 Time
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.