Download presentation
Presentation is loading. Please wait.
Published byDuane John George Modified over 9 years ago
2
3.13. Fallacies and Pitfalls Fallacy: Processors with lower CPIs will always be faster Fallacy: Processors with faster clock rates will always be faster –Balance must be found: E.g. sophisticated pipeline: CPI ↓ clock cycle ↑
3
Fallacies and Pitfalls Pitfall: Emphasizing improving CPI by increasing issue rate, while sacrificing clock rate can decrease performance –Again, question of balance SuperSPARC –vs– HP PA 7100 –Complex interactions between cycle time and organisation
4
Fallacies and Pitfalls Pitfall: Improving only one aspect of a multiple-issue processor and expecting overall performance improvement –Amdahl’s Law! –Boosting performance of one area may uncover problems in another
5
Fallacies and Pitfalls Pitfall: Sometimes bigger and dumber is better! –Alpha 21264: sophisticated multilevel tournament branch predictor –Alpha 21164: simple two-bit predictor –21164 performs better for transaction processing application! Can handle twice as many local branch predictions
6
Concluding Remarks Lots of open questions! –Clock speed –vs– CPI –Power issues –Exploiting parallelism ILP –vs– explicit
7
Characteristics of Modern (2001) Processors Figure 3.61 –3–4 way superscalar –4–22 stage pipelines –Branch prediction –Register renaming (except UltraSPARC) –400MHz – 1.7GHz –7–130 million transistors
8
Chapter 4 Exploiting ILP with Software
9
4.1. Compiler Techniques for Exposing ILP Compilers can improve the performance of simple pipelines –Reduce data hazards –Reduce control hazards
10
Loop Unrolling Compiler technique to increase ILP –Duplicate loop body –Decrease iterations Example: –Basic code: 10 cycles per iteration –Scheduled: 6 cycles for (int k = 0; k < 1000; k++) { x[k] = x[k] + s; }
11
for (int k = 0; k < 1000; k++) { x[k] = x[k] + s; } for (int k = 0; k < 1000; k+=4) { x[k] = x[k] + s; x[k+1] = x[k+1] + s; x[k+2] = x[k+2] + s; x[k+3] = x[k+3] + s; } Loop Unrolling Basic code: 7 cycles per “iteration” Scheduled: 3.5 cycles (no stalls!)
12
Loop Unrolling Requires clever compilers –Analysing data dependences, name dependences and control dependences Limitations –Code size –Decrease in amortisation of overheads –“Register pressure” –Compiler limitations Useful for any architecture
13
Superscalar Performance Two-issue MIPS (int + FP) 2.4 cycles per “iteration” –Unrolled five times
14
4.2. Static Branch Prediction Useful: –where behaviour can be predicted at compile- time –to assist dynamic prediction Architectural support –Delayed branches
15
Static Branch Prediction Simple: –Predict taken –Has average misprediction rate of 34% (SPEC) –Range: 59% – 9% Better: –Predict backward taken, forward not-taken –Worse for SPEC!
16
Static Branch Prediction Advanced compiler analysis can do better Profiling is very useful –FP: 9% ± 4% –Int: 15% ± 5%
17
4.3. Static Multiple Issue: VLIW Compiler groups instructions into “packets”, checking for dependences –Remove dependences –Flag dependences Simplifies hardware
18
VLIW First machines used a wide instruction with multiple operations per instruction –Hence Very Long Instruction Word (VLIW) –64–128 bits Alternative: group several instructions into an issue packet
19
VLIW Architectures Multiple functional units Compiler selects instructions for each unit to create one long instruction/an issue packet Example: five operations –Integer/branch, 2 × FP, 2 × memory access Need lots of parallelism –Use loop unrolling, or global scheduling
20
Example Loop unrolled seven times! 1.29 cycles per result 60% of available instruction “slots” filled for (int k = 0; k < 1000; k++) { x[k] = x[k] + s; }
21
Summary of Improvements TechniqueUnscheduledScheduled Basic code106 Loop unrolled (4)73.5 Superscalar (5)2.4 VLIW (7)1.29
22
Drawbacks of Original VLIWs Large code size –Need to use loop unrolling –Wasted space for unused slots Clever encoding techniques, compression Lock-step execution –Stalling one unit stalls them all Binary code compatibility –Variations on structure required recompilation
23
4.4. Compiler Support for Exploiting ILP We will not cover this section in detail Loop unrolling –Loop-carried dependences Software pipelining –Interleave instructions from different iterations
24
4.5. Hardware Support for Extracting More Parallelism Techniques like loop-unrolling work well when branch behaviour can be predicted at compile time If not, we need more advanced techniques: –Conditional instructions –Hardware support for compiler speculation
25
Conditional or Predicated Instructions Instructions have associated conditions –If condition is true execution proceeds normally –If not, instruction becomes a no-op Removes control hazards if (a == 0) b = c; bnez %r8, L1 nop mov %r1, %r2 L1:... cmovz %r8, %r1, %r2
26
Conditional Instructions Control hazards effectively replaced by data hazards Can be used for speculation –Compiler reorders instructions depending on likely outcome of branches
27
Limitations on Conditional Instructions Annulled instructions still execute –But may occupy otherwise stalled time Most useful when conditions evaluated early Limited usefulness for complex conditions May be slower than unconditional operations
28
Conditional Instructions in Practice MachineConditional Instructions MIPS, Alpha, SPARC Move HP PA Any register-register instruction can annul the following instruction IA-64Full predication
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.