Download presentation
Presentation is loading. Please wait.
1
A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University
2
Outline Motivation Thread level speculation (TLS) Coherence scheme Optimizations Methodology Results Conclusion
3
Motivation Leading chip manufactures going for multi- core architectures Usually used to increase throughput To exploit these parallel resources to increase performance – need to parallelize programs Integer programs hard to parallelize Use speculation – thread level speculation (TLS)!
4
Thread level speculation (TLS)
5
Scalable Approach The paper aims to design a scalable approach which applies to wide variety of multi- processor like architectures Only limitation is that the architecture should be shared memory based The TLS is implemented over the invalidation based cache coherence protocol
6
Example Each cache line has special bits SL – speculative load has accessed the line SM – the line is speculatively modified Thread is squashed if Line is present SL is set If epoch number indicates an earlier thread
7
Speculation level We are concerned only with the speculation level – level in the cache hierarchy where the cache protocol begins We can ignore all the other levels
8
Cache line states Apart from the cache state bits we need SL and SM bits A cache line with speculative bits set cannot be replaced The thread is either squashed or the operation is delayed
9
Basic cache coherence protocol When a processor wants to load a value, it atleast needs shared access to the line When it wants to write, it needs exclusive access Coherence mechanism issues invalidation message when it receives request for exclusive access
10
Coherence mechanism
11
Commit When the homefree token arrives there is no possibility of further squashes SpE is changed to E and SpS to S Lines with SM bit set has to have D bit set If a line is speculatively modified and shared, we have to get exclusive access for that line Ownership required buffer (ORB) is used to track such lines
12
Squash All speculatively modified lines have to be invalidated SpE is changed to E and SpS to S
13
Performance Optimizations Forwarding Data Between Epochs: Predictable data dependences are synchronized Dirty and Speculatively Loaded State: Usually if a dirty line is speculatively loaded, it is flushed – this can be avoided Suspending Violations: When we have to evict a speculative line, we don ’ t need to squash
14
Multiple writers If two epochs write to the same line – we have to squash one to avoid multiple writer problem Possible to avoid this by maintaining fine grained disambiguation bits
15
Implementation
16
Epoch numbers Has two parts – TID and sequence number To avoid costly comparisons during every access – the difference is precomputed and a logically later mask is formed Epoch numbers are maintained at one place for one chip
17
Speculative state implementation
18
Multiple writers - implementation False violations are also handled in the same way
19
Correctness considerations Speculation fails if the speculative state is lost Exceptions are handled only when the homefree token is got System calls are also postponed
20
Methodology Detailed out-of-order simulation based on MIPS R10000 is done Fork and other synchronization overhead is 10 cycles
21
Results Normalized execution cycles
22
Results Buk and equake – memory performance is a bottleneck When increased more than 4 processors ijpeg performance degrades Number of threads available is less Some conflicts in cache
23
Overheads Violations Cache locality is important ORB size can be further reduced – early release of ORB
24
Communication overhead Buk is insensitive
25
Multiprocessor performance Advantages More cache storage Disadvantage Increased communication latency
26
Conclusion By using TLS even integer programs can be parallelized to get speedup The approach is scalable and can be applied to various other architectures which support multiple threads There are applications that are insensitive to communication latency – so large scale parallel architectures using TLS are possible
27
Thanks!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.