Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University.

Similar presentations


Presentation on theme: "A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University."— Presentation transcript:

1 A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University

2 Outline  Motivation  Thread level speculation (TLS)  Coherence scheme  Optimizations  Methodology  Results  Conclusion

3 Motivation  Leading chip manufactures going for multi- core architectures  Usually used to increase throughput  To exploit these parallel resources to increase performance – need to parallelize programs  Integer programs hard to parallelize  Use speculation – thread level speculation (TLS)!

4 Thread level speculation (TLS)

5 Scalable Approach  The paper aims to design a scalable approach which applies to wide variety of multi- processor like architectures  Only limitation is that the architecture should be shared memory based  The TLS is implemented over the invalidation based cache coherence protocol

6 Example  Each cache line has special bits SL – speculative load has accessed the line SM – the line is speculatively modified  Thread is squashed if Line is present SL is set If epoch number indicates an earlier thread

7 Speculation level  We are concerned only with the speculation level – level in the cache hierarchy where the cache protocol begins  We can ignore all the other levels

8 Cache line states  Apart from the cache state bits we need SL and SM bits  A cache line with speculative bits set cannot be replaced  The thread is either squashed or the operation is delayed

9 Basic cache coherence protocol  When a processor wants to load a value, it atleast needs shared access to the line  When it wants to write, it needs exclusive access  Coherence mechanism issues invalidation message when it receives request for exclusive access

10 Coherence mechanism

11 Commit  When the homefree token arrives there is no possibility of further squashes  SpE is changed to E and SpS to S  Lines with SM bit set has to have D bit set  If a line is speculatively modified and shared, we have to get exclusive access for that line Ownership required buffer (ORB) is used to track such lines

12 Squash  All speculatively modified lines have to be invalidated  SpE is changed to E and SpS to S

13 Performance Optimizations  Forwarding Data Between Epochs: Predictable data dependences are synchronized  Dirty and Speculatively Loaded State: Usually if a dirty line is speculatively loaded, it is flushed – this can be avoided  Suspending Violations: When we have to evict a speculative line, we don ’ t need to squash

14 Multiple writers  If two epochs write to the same line – we have to squash one to avoid multiple writer problem  Possible to avoid this by maintaining fine grained disambiguation bits

15 Implementation

16 Epoch numbers  Has two parts – TID and sequence number  To avoid costly comparisons during every access – the difference is precomputed and a logically later mask is formed  Epoch numbers are maintained at one place for one chip

17 Speculative state implementation

18 Multiple writers - implementation  False violations are also handled in the same way

19 Correctness considerations  Speculation fails if the speculative state is lost  Exceptions are handled only when the homefree token is got  System calls are also postponed

20 Methodology  Detailed out-of-order simulation based on MIPS R10000 is done  Fork and other synchronization overhead is 10 cycles

21 Results  Normalized execution cycles

22 Results  Buk and equake – memory performance is a bottleneck  When increased more than 4 processors ijpeg performance degrades Number of threads available is less Some conflicts in cache

23 Overheads  Violations  Cache locality is important  ORB size can be further reduced – early release of ORB

24 Communication overhead  Buk is insensitive

25 Multiprocessor performance  Advantages More cache storage  Disadvantage Increased communication latency

26 Conclusion  By using TLS even integer programs can be parallelized to get speedup  The approach is scalable and can be applied to various other architectures which support multiple threads  There are applications that are insensitive to communication latency – so large scale parallel architectures using TLS are possible

27 Thanks!


Download ppt "A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University."

Similar presentations


Ads by Google