Download presentation
Presentation is loading. Please wait.
1
More on Thread Level Speculation Anthony Gitter Dafna Shahaf Or Sheffet
2
Thread Level Speculation (TLS) A technique for automatic parallelization. Run threads in parallel, but in a speculative state. Check for violations. Commit upon successful completion. Squash when detecting a violation. – Propagate the squash onwards. – Re-run the thread.
3
Thread Level Speculation Example
4
Mechanism of TLS 1.Managing speculative state. 2.Disambiguation: checking addresses for violating dependencies – Eager vs. Lazy 3.Upon commit – Broadcast (Everybody? Relevant?) – Invalidate/update of other threads – Leave speculative state 4.Upon squash – Broadcast – Invalidate changes for this thread – Re-run At hardware level. Involve Cache. Simple. Fast.
5
Scenarios Thread attributes: – Length – Memory accesses – Dependences ? Many ? Many ??0??0 ??0??0 Serial Easily parallel Short Many Few Short Many Few TLS costly Short Few Short Few TLS works Long Many Long Many TLS costly Long Few Long Few TLS costly Length Accesses Depend.
6
When is TLS Too Costly? “Too much data” scenario – Thread touches too many addresses. “Too much time” scenario – Execution involves many instructions (e.g. Databases transactions). Bulk Disambiguation of Speculative Threads in multiprocessors Ceze, Tuck, Cascaval, Torrellas. Tolerating Dependences Between Large Speculative Threads Via Sub-Threads Colohan, Ailamaki, Steffan, Mowry.
7
Too Many Addresses – Solution 1 Each thread maintains a bitwise mask of the cache. Flip bit on when touching an address. Upon completion, check addresses you and others touched. (Lazy) Commit / Squash : send mask. Invalidating/replacing/changing address state in cache: use mask. All bitwise operations. Very simple! Infeasible for size reasons (won’t scale).
8
Solution: Hash! Introducing BULK - a hardware that hashes the address space into a signature (~2k in size). 010100001 001100100 011100101 Address Space Signature Bitwise OR Upon completion, send signature! Upon receiving, pull back to a superset of possible addresses.
9
Bulk Features: Separate Reading / Writing signatures. Committing: sending signature. Invalidating: pulling back signature into a superset. Granularity is on word level (not cache line) – since we map addresses Caveat: We might see violations even if there weren't any!
10
Bulk Performance
11
Fraction of False Positives as a function of Signature Length
12
When is TLS Too Costly? “Too much data” scenario – Thread touches too many addresses. “Too much time” scenario – Execution involves many instructions (e.g. Databases transactions). Bulk Disambiguation of Speculative Threads in multiprocessors Ceze, Tuck, Cascaval, Torrellas. Tolerating Dependences Between Large Speculative Threads Via Sub-Threads Colohan, Ailamaki, Steffan, Mowry.
13
Handling Long Threads (Attempt 1) Image courtesy Chris Colohan Q: Does eliminating a data dependence help? *p= *q= =*p R2 Violation! =*p =*q Parallel Upon violation – we re-execute a long thread.
14
Handling Long Threads (Attempt 1) *p= *q= =*p R2 Violation! =*p =*q Parallel *q= =*q Violation! Eliminate *p Dep. Image courtesy Chris Colohan
15
Handling Long Threads (Attempt 2): Sub-Threads Sub-threads are checkpoints during thread execution No longer “all or nothing” Must be lightweight Help with primary and secondary violations *q= Violation! =*q Image courtesy Chris Colohan
16
Sub-thread Implementation Assume CMP with shared L2 L1 is unaware of sub-threads – Speculatively modified bit per cache line L2 performs eager violation detection – 2 additional bits per cache line per sub-thread – Replication to track different sub-thread contexts
17
17 Sub-thread Evaluation 0 0.2 0.4 0.6 0.8 1 1.2 Idle CPU Failed Cache Miss Busy Time (normalized) New Order New Order 150 Delivery Delivery Outer Stock Level Payment Order Status NSLNSLNSLNSLNSLNSLNSL N = no sub-threads S = with sub-threads L = limit, ignoring violations Image courtesy Chris Colohan
18
Summary Thread attributes: – Length – Memory accesses – Dependences ? Many ? Many ??0??0 ??0??0 Serial Easily parallel Short Few Short Few TLS works Long Many Few Long Many Few Hopeless?? Length Accesses Depend. Short Many Few Short Many Few Long Few Long Few TLS costly BULK TLS costly Sub-Threads
19
Open Questions Long threads that also touch many addresses. – Bulk on top of sub-threads? Combining lazy/eager evaluations Thank you!
20
Backup Slides
21
21 Buffering Large Threads store X, 0x00 L1$ 0x00: 0x01: L2$ X 0x00: 0x01: L1$ 0x00: 0x01: XS1 Store and load bit per thread Slide courtesy Chris Colohan
22
22 Buffering Large Threads store X, 0x00 store A, 0x01 L1$ 0x00: 0x01: L2$ X A 0x00: L1$ 0x00: 0x01: X A S1 0x01: Slide courtesy Chris Colohan
23
23 Buffering Large Threads store X, 0x00 store A, 0x01load 0x00 L1$ 0x00: 0x01: L2$ X A 0x00: 0x01: L1$ 0x00: 0x01: X X A S1 L2 Slide courtesy Chris Colohan
24
24 XL2 XS1 Buffering Large Threads store X, 0x00 store A, 0x01load 0x00 store Y, 0x00 L1$ 0x00: 0x01: L2$ X A 0x00: 0x01: L1$ 0x00: 0x01: XY AS1 YS2L2 Replicate line – one version per thread Slide courtesy Chris Colohan
25
25 Buffering Large Threads store X, 0x00 store A, 0x01load 0x00 load 0x01 store Y, 0x00 L1$ 0x00: 0x01: L2$ X A 0x00: 0x01: X A Y L1$ 0x00: 0x01: Y A S1 S2L2 S1L2 Slide courtesy Chris Colohan
26
26 Buffering Large Threads store X, 0x00 store A, 0x01load 0x00 load 0x01 store Y, 0x00 store B, 0x01 L1$ 0x00: 0x01: L2$ X A 0x00: 0x01: X A L1$ 0x00: 0x01: Y A S1 YS2L2 S1L2 B B Slide courtesy Chris Colohan
27
27 Sub-thread Support store X, 0x00 store A, 0x01load 0x00 load 0x01 store Y, 0x00 store B, 0x01 L1$ 0x00: 0x01: L2$ X A 0x00: 0x01: X A L1$ 0x00: 0x01: S1 L2 B B Y YS2L2 a { b { Divide into two sub-threads Only roll back violated sub-thread Slide courtesy Chris Colohan
28
Copyright 2006 Chris Colohan28 Sub-thread Support store X, 0x00 store A, 0x01load 0x00 load 0x01 store Y, 0x00 L1$ 0x00: 0x01: L2$ X A 0x00: 0x01: X A Y L1$ 0x00: 0x01: A S1a A A S2aL2a L2b Y a { b { Store and load bit per sub-thread store B, 0x01 B Slide courtesy Chris Colohan
29
Copyright 2006 Chris Colohan29 A AAL2b S1a Sub-thread Support store X, 0x00 store A, 0x01load 0x00 load 0x01 store Y, 0x00 L1$ 0x00: 0x01: L2$ X A 0x00: 0x01: X Y L1$ 0x00: 0x01: Y S1a A S2aL2a B store B, 0x01 S1b AB a { b { Slide courtesy Chris Colohan
30
Sub-thread Evaluation Evaluate using large database transactions Parallelize the loops Can we place an upper bound on the possible speedup?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.