Download presentation
Presentation is loading. Please wait.
1
Multiscalar Processors
Presented by Matthew Misler Gurindar S. Sohi, Scott E. Breach, T. N. Vijayjumar University of Wisconsin-Madison ISCA ‘95
2
Scalar Processors Instruction Queue Execution Unit addu $20, $20, 16
ld $23, SYMVAL -16($20) move $17, $21 beq $17, $0, SKIPINNER ld $8, LELE($17)
3
SuperScalar Processors
Instruction Queue Execution Unit addu $20, $20, 16 ld $23, SYMVAL -16($20) move $17, $21 beq $17, $0, SKIPINNER ld $8, LELE($17)
4
Fetch-Execute Paradigm has been around for about 60 years
Superscalar processors to execute instructions out of order Sometimes re-ordering done in hardware Sometimes software Sometimes both Partial ordering
5
Control Flow Graphs Segments are split on control dependencies (conditional branches)
6
Sequential “Walk” Walk through the CFG with enough parallelism
Use speculative execution and branch prediction to raise the level of parallelism Sequential semantics must be preserved Can still execute out of order, but in-order commit
7
Multiscalars and Tasks
CFG broken down into tasks Multiscalars step through at the task level No inspection of instructions within a task Each Task is assigned to one ‘processing unit’ Multiple tasks can execute in parallel
8
Multiscalar Microarchitecture
Sequencer Queue of processing units Unidirectional ring Each has an instruction cache, processing element, register file Interconnect Data Bank Each has: address resolution buffer, data cache
9
Multiscalar Microarchitecture
10
Outline Multiscalar Microarchitecture Tasks Multiscalars in-depth
Distribution of cycles Comparison to other paradigms Performance Conclusion
11
Tasks Sequencer distributes a task to a Processing unit
Unit fetches and executes the task until completion Instructions in the window are bounded By the first instruction in the earliest executing task By the last instruction in the latest executing task
12
Tasks Sequencer distributes a task to a Processing unit
Unit fetches and executes the task until completion The Instruction Window is bounded by The first instruction in the earliest executing task The last instruction in the latest executing task So? Instruction windows can be huge
13
Tasks Example A B C D E A B C B B C D
14
Tasks Example A B C D E A B C B B C D A B B C D A B C B C D E
15
Tasks Hold true to sequential semantics inside each block
Enforce sequential order overall on tasks The circular queue takes care of this part In the previous example: Head of queue does ABCBBCD Middle unit does ABBCD Tail of the queue ABCBCDE
16
Tasks Registers Memory Create mask Accum mask
May produce values for a future task Forward values down the ring Accum mask Union of the create masks of active tasks Memory If it’s a known producer-consumer, then synchronize on loads and stores
17
Tasks Memory (cont’d) Conservative approach means sequential operation
Unknown P-C relationship Conservative approach: wait Aggressive approach: speculate Conservative approach means sequential operation Aggressive approach requires dynamic checking, squashing and recovery
18
Outline Multiscalar basics Tasks Multiscalars in-depth
Distribution of cycles Comparison to other paradigms Performance Conclusion
19
Multiscalar Programs Code for the tasks Structure of the CFG and tasks
Small changes to existing ISA add specification of tasks no major overhaul Structure of the CFG and tasks Communications between tasks
20
Control Flow Graph Structure
Successors Task descriptor Producing and consuming values Forward register information on last update Compiler can mark instructions: operate and forward Stopping conditions Special condition, evaluate conditions, complete All of these can be viewed as tag bits
21
Multiscalar Hardware Walks through the CFG
Assign tasks to processing units Execute tasks in a ‘sequential’ order Sequencer fetches the task descriptors Using the address of the first instruction Specifying the create masks Constructing the accum mask Using the task descriptor, predict successor
22
Multiscalar Hardware Databanks Use of Address Resolution Buffer
Updates to cache not speculative Use of Address Resolution Buffer Detects violation of dependencies Initiates corrective actions If it runs out of space, squash tasks Not the head of the queue; it doesn’t use the ARB Can stall rather than squash
23
Multiscalar Hardware Remember the earlier architectural picture?
24
Multiscalar Hardware It’s not the only possible architecture
Possible design with shared functional units Possible design with ARB and data cache on the same side as the processing units Scaling the interconnect is non-trivial Glossed over Page 5 last paragraph
25
Outline Multiscalar Basics Tasks Multiscalars In-Depth
Distribution of Cycles Comparison to Other Paradigms Performance Conclusion
26
Distribution of Cycles
Wasted cycles: Non-useful computation Squashed No computation Waiting Remains idle No assigned task
27
Distribution of Cycles
Non-useful computation cycles Determine useless computation early Validate prediction early Check if the next task is predicted correctly Eg. Test for loop exit at the start of the loop Tasks violating sequentiality are squashed To avoid, try to synchronize memory communication with register communication Could delay the load for a number of cycles Can use signal-wait synchronization
28
Distribution of Cycles
Contrast with no assigned task No computation cycles Dependencies within the same task Dependencies between tasks (earlier/later) Load Balancing
29
Outline Multiscalar Basics Tasks Multiscalars In-Depth
Distribution of Cycles Comparison to Other Paradigms Performance Conclusion
30
Comparison to Other Paradigms
Branch prediction Sequencer only needs to predict branches across tasks Wide instruction window Check to see which is ready for issue, in Multiscalar relatively few ready for inspection
31
Comparison to Other Paradigms
Issue logic Superscalar processors have n2 logic Multiscalar logic is distributed, Each processing unit issues instructions independently Loads and stores Normally sequence numbers for managing the buffers In multiscalar, the loads and stores are independent
32
Comparison to Other Paradigms
Superscalar processors need to discover CFG as it decodes branches Only requires the compiler to split code into tasks Multiprocessors require all dependence to be known or conservatively provided for If a compiler could compile independently, it can be executed in parallel
33
Outline Multiscalar Basics Tasks Multiscalars In-Depth
Distribution of Cycles Comparison to Other Paradigms Performance Conclusion
34
Functional unit latency
Performance Simulated 5 stage pipeline Functional unit latency
35
Performance Memory 1024 entry cache of task descriptors
Non-blocking loads and stores 10 cycle latency for first 4 words 1 cycle for each additional 4 words Instruction Cache: 1 cycle for 4 words 10+3 cycles for miss Data Cache: 1 word per cycle multiscalar 10+3 cycles + bus contention, for a miss 1024 entry cache of task descriptors
36
Performance +12.2% on average
37
Performance – In-Order
38
Performance – Out-of-Order
39
Performance – Summary Most of the benchmarks achieve speedup
Eg. An average of in 1-way in-order 4-unit multiscalar Worst case 0.86 speedup (slowdown) Many squashes in prediction and memory order in Gcc and Xlisp Leads to almost sequential execution Keeping in mind, 12.2% increase in IC
40
Outline Multiscalar Basics Tasks Multiscalars In-Depth
Distribution of Cycles Comparison to Other Paradigms Performance Conclusion
41
Conclusion Divide the CFG into tasks Walk the CFG in task-size steps
Assign tasks to processing units Walk the CFG in task-size steps Shows performance gains
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.