Download presentation
Presentation is loading. Please wait.
Published byGarry Goodwin Modified over 9 years ago
1
MULTIPLEX: UNIFYING CONVENTIONAL AND SPECULATIVE THREAD-LEVEL PARALLELISM ON A CHIP MULTIPROCESSOR Presented by: Ashok Venkatesan Chong-Liang Ooi, Seon Wook Kim, II Park, Rudolf Eigenmann, Babak Falsafi and T.N. Vijayakumar
2
Outline Background Thread Level Parallelism(TLP) Explicit & Implicit TLP An Example Multiplex Threading Model MUCS protocol Key Performance Factors Performance Analysis Conclusion
3
Thread Level Parallelism ILP Wall Increasing CPI with increasing clock rates Limited ILP in applications Insufficient memory locality Using TLP Increased granularity of parallelism Exploitation of Multi-cores Threads: A Logical sub-process that carries its own state. State – Instructions, data, PC, register file, stack, etc.,
4
Explicit & Implicit TLP Explicit TLP Program is explicitly partitioned into threads by programmer and an API is used to dispatch and execute on multiple cores. Static – defined in the program Main Overhead – Thread Dispatch Implicit or Speculative TLP Threads are peeled off from a sequential execution stream of the program by hardware prediction. Dynamic – runtime prediction Main Overhead – Speculative State Overflow
5
Example – Exec Explicit Threads Data Dependence is resolved using a barrier here Dispatch of threads is done using a fork (System API) call
6
Example – Exec Implicit Threads Both data dependence as well as dispatch are handled by a hardware predictor
7
Multiplex Unifies explicit and implicit threading on a CMP Obviates the need for serializing unanalyzable program segments by using speculative TLP Avoids implicit threading’s speculation overhead and performance loss in compiler-analyzable program segments by using explicit threading. Implements a single snoopy bus protocol to unify cache coherence with memory renaming and disambiguation.
8
Anatomy of a Multiplex CMP
9
Threading Model Thread selection Partitioning code into distinct instruction sequences. Thread dispatch Assigning threads to execute on different CPUs Data communication and speculation Propagating data between independent threads.
10
Thread Selection in Multiplex Methodology Compiler chooses between threading models Prioritizes explicit threading over implicit threading Implicit threads selected by runtime speculation by hardware However, software specifies implicit thread boundaries Pros – Minimizes explicit and implicit overheads Scenarios Executing loops with small bodies implicitly Executing tail ends of unevenly partitioned segments implicitly
11
Thread Dispatch – An Overview Dispatching conventional threads involve Assigning PCs of CPUs the address of the first instruction of the thread Assigning a private SP to CPUs Copying stacks and register values prior to dispatch Thread Descriptor – holds thread information Stores the addresses of possible subsequent dispatch target threads Holds register dependency information
12
Thread Dispatch in Multiplex Methodology Predict subsequent threads based on current threads Dispatch, execute and commit sequentially Re-dispatch on squashing Suspend dispatch upon mode switch to allow thread commits to complete Instruction Set Changes - fork, stop and setsp A Thread Predictor unit added to handle speculative prediction A mode bit added to the Thread Descriptor A TD Cache caches recently referenced descriptors
13
MUCS Protocol Mux Unified Coherence and Speculation - MUCS Offers data coherence as well as versioning support Key Design Objectives – minimize speculation overheads in two respects Dependence resolution in the common case should be handled within the cache thereby minimizing bus transactions Thread commit/squashes should only be done en masse and not as individual cache blocks.
14
MUCS Protocol
15
StateActionState bits AffectedMode Speculative 1. Load/Read Miss 2. Fill cache with latest version of cache block as per program order 3. Set use bit if load is executed before a store 4. Clear commit bit 5. Clear squash bituse, commit, squashimplicit Speculative 1. Store/Write Miss 2. Fill cache with latest version from L2, write and store 3. Do not invalidate other caches 4. Set dirty bit 5. Set preceding cache stale bit 6.Clear commit bitdirty, stale, commitimplicit Committed 1. Commit Thread 2. Set commit bit en masse 3. Clear use bitcommit, useimplicit Squashed 1.Squash Thread 2. Set squash bit en masse 3. Clear use bit en massesquashimplicit
16
MUCS Protocol 6 bits used for monitoring states of each cache block Use – Set per speculative load executed before store Dirty – Set per speculative store in both modes Commit – Set en masse on commit of speculative blocks Stale – Set on a cache block when a newer version of data is available in another CPU Squash – Set en masse on a cache touched by a squashed thread Valid – Set per cache fill upon misses in both modes to determine validity of tag (not data)
17
Key Performance Factors Thread Size Load Imbalance Data Dependence Thread dispatch/completion overhead Speculative State Overflow
18
Performance Analysis – System Info
19
Performance Analysis – Best Case Class 1 applications favor Implicit-only CMPs Class 2 applications favor explicit-only CMPs Avg Speedup of 4 dual issue CMP over one dual issue CMP Implicit-only=1.14, Explicit-only=2.17, Multiplex = 2.3
20
Performance Analysis - Overheads I – implicit only, m - multiplex Fpppp: provably parallel code = 0%, low squash buffer hits wave5, tomcatv and swim have control flow irregularities in the inner loop i.e I/O stalls
21
Performance Analysis – Cache Size Effects of increasing cache size – performance increases Multiplex incurs lesser overflow than implicit-only CMP Effects of increasing data rates – performance decreases
22
Conclusion Coexistence of implicit and explicit multi-threading brings about a better speedup, showing a speedup of 2.63 during simulation MUCS protocol allows such an implementation by mapping a coherence protocol needed for explicit threading to a subset of the states required for implicit threading and hence eliminates the need of extra hardware. The dominant overheads for implicit and explicit threading are speculative state overflow and thread dispatching respectively.
23
Questions?
24
Thank you
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.