Multiscalar processors

Multiscalar processors
Gurindar S. Sohi Scott E. Breach T.N. Vijaykumar University of Wisconsin-Madison

Outline Motivation Multiscalar paradigm Multiscalar architecture
Software and hardware support Distribution of cycles Results Conclusion

Motivation Current architecture techniques reaching their limits
Amount of ILP that can be extracted by superscalar processor is limited Kunle Olukotun (stanford university)

Limits of ILP Parallelism that can be extracted from a single program is very limited – 4 or 5 in integer programs Limits of instruction-level parallelism- David W. Wall (1990)

Limitations of superscalar
Branch prediction accuracy limits ILP Every 5 instruction is a branch Executing an instruction across 5 branches leads to useful result only 60% of the time (with branch prediction accuracy 90%) There are branches which are difficult to predict – increasing the window size doesn’t always means executing useful instructions

Limitations of superscalar.. contd
Large window size Issuing more instructions per cycle needs large window of instructions Each cycle search the whole window to find instructions to issue Increases the pipeline length Issue complexity To issue an instruction dependence checks have to be performed with other issuing instructions To issue n instructions complexity of issue is n2

Limitations of superscalar.. contd
Load and store queue limitations Loads and stores cannot be reordered before knowing their addresses One load or store waiting for its address can block the entire processor

Superscalar limitation example
Consider the following hypothetical loop: Iter 1: inst 1 inst 2 … inst n Iter 2: If window size is less than n, superscalar considers only one iteration at a time Possible improvement Iter 1: iter 2: inst 1 inst 1 inst 2 inst 2 … … … inst n inst n

Multiscalar paradigm Divide the program (CFG) into multiple tasks (not necessarily parallel) Execute the tasks in different processing elements, residing in the same die – communication cost is less Sequential semantics is preserved by hardware and software mechanisms Tasks are typically re-executed if there is any violations

Crossing the limits of superscalar
Branch prediction Each thread executes independently Each thread is limited by branch prediction – but number of useful instructions available is much larger than superscalar Window size Each processing element has its own window Total size of the windows in a die can be very large, while each window can be of moderate size

Crossing the limits of superscalar.. contd
Issue Complexity Each processing element issue only a few instructions – simplifies logic Loads and Stores Loads and stores can executed without waiting for the previous thread’s load or store

Multiscalar architecture
A possible microarchitecture

Multiscalar execution
The sequencer walks over the CFG According the hints inserted in the code, it assigns tasks to PEs PEs execute the tasks in parallel Maintaining sequential semantics Register dependencies Memory dependencies Tasks are assigned in the ring order and are committed in the ring order

Register Dependencies
Register dependencies can be easily identified using compiler Dependencies are always synchronized Registers that a task may write are maintained in a create mask Reservations are created in the successor tasks using the accum mask If the reservation exist (value not arrived), the instruction reading the register waits

Memory dependencies Cannot be statically found
Multiscalar uses an aggressive approach – speculate always The loads don’t wait for stores in the predecessor tasks Hardware checks for violation and the task is re-executed if it violates any memory dependency

Task commit Speculative tasks are not allowed to modify memory
Store values are buffered in hardware When the processing element becomes head – it retires its values into memory In order to maintain sequential semantics the tasks retire in order – ring arrangement of processing elements

Compiler support Structure of CFG Sequencer needs information of tasks
Compiler or a assembly code analyzer marks the structure of the CFG – task boundaries Sequencer walks through this information

Compiler support .. contd
Communication information Gives the create mask as part of task header Sets the forward and stop bits Register value is forwarded if forward bit is set Task is done when it sees a stop bit Also needs to give release information

Hardware support Need to buffer speculative values
Need to detect memory dependence violations If a speculative thread loads a value its address is recorded in ARB If a thread stores into some location, then ARB is checked to see if there was a load from the same location by a later thread Also the speculative values are buffered

Cycle distribution Best scenario – all processing element does useful work always – never happens Possible wastage Non-useful computation If the task is squashed later due to incorrect value or incorrect prediction No computation Waits for some dependency to be resolved Waits to commit its result Remains idle No task assigned

Non-useful computation
Synchronization of memory values Squashes usually occur on global or static data values Easy to predict this dependency Explicitly synchronizations can be inserted to eliminate squashes due these dependencies Early validation of prediction For example loop exit testing can be done at the beginning of the iteration

No computation Intra-task dependences Inter-task dependences
These can be eliminated through a variety of hardware and software techniques Inter-task dependences Possible scope for scheduling to reduce the wait time Load balancing Tasks retire in-order Some tasks finish fast and wait for a long time to become the head task

Differences with other paradigms
Major improvement over superscalar VLIW – limited because of the limits of static optimizations Multiprocessor Very much similar Communication costs is very less Leads to fine grained thread parallelism

Methodology Simulator which uses MIPS code 5 stage pipeline
Sequencer has a 1024 entry direct mapped cache of task descriptors

Results

Results Compress – long critical path
Eqntott and cmppt – has parallel loops with good coverage Espresso – one loop has load balancing issue Sc – also has load imbalance Tomcatv – good parallel loops Cmp and wc – intra task dependences

Conclusion Multiscalar paradigm has very good potential
Tackles the major limits of superscalar Lots of scope for compiler and hardware optimizations Paper gives a good introduction to the paradigm and also discusses the major optimization opportunities

Discussion

BREAK!

Multiscalar processors

Similar presentations

Presentation on theme: "Multiscalar processors"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Multiscalar processors

Similar presentations

Presentation on theme: "Multiscalar processors"— Presentation transcript:

Similar presentations

About project

Feedback