CS 7810 Paper critiques and class participation: 25% Final exam: 25% Project: Simplescalar (?) modeling, simulation, and analysis: 50% Read and think about.

Slides:



Advertisements
Similar presentations
Computer Organization and Architecture
Advertisements

Computer Architecture Instruction-Level Parallel Processors
CSCI 4717/5717 Computer Architecture
CS136, Advanced Architecture Limits to ILP Simultaneous Multithreading.
CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.
Instruction-Level Parallel Processors {Objective: executing two or more instructions in parallel} 4.1 Evolution and overview of ILP-processors 4.2 Dependencies.
Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.
Lecture 8: More ILP stuff Professor Alvin R. Lebeck Computer Science 220 Fall 2001.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
CS 7810 Lecture 8 Memory Dependence Prediction using Store Sets G.Z. Chrysos and J.S. Emer Proceedings of ISCA
CPE 731 Advanced Computer Architecture ILP: Part II – Branch Prediction Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
W04S1 COMP s1 Seminar 4: Branch Prediction Slides due to David A. Patterson, 2001.
1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.
1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.
1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: branch prediction, out-of-order processors (Sections )
1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )
Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.
CS 7810 Lecture 3 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.
1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )
1 Lecture 18: Pipelining Today’s topics:  Hazards and instruction scheduling  Branch prediction  Out-of-order execution Reminder:  Assignment 7 will.
1 Lecture 10: ILP Innovations Today: handling memory dependences with the LSQ and innovations for each pipeline stage (Section 3.5)
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections )
1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
1 Lecture 7: Branch prediction Topics: bimodal, global, local branch prediction (Sections )
CS 7810 Lecture 21 Threaded Multiple Path Execution S. Wallace, B. Calder, D. Tullsen Proceedings of ISCA-25 June 1998.
1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power ISCA workshops Sign up for class presentations.
1 Lecture 7: Static ILP and branch prediction Topics: static speculation and branch prediction (Appendix G, Section 2.3)
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
CS Lecture 4 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.
Limits of Instruction-Level Parallelism Presentation by: Robert Duckles CSE 520 Paper being presented: Limits of Instruction-Level Parallelism David W.
CS Lecture 2 Limits of Instruction-Level Parallelism David W. Wall WRL Research Report 93/6 Also appears in ASPLOS’91.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.
1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations.
CS Paper critiques: 25% of course grade Class participation: 5% Final exam: 20% Project: Simplescalar (?) modeling, simulation, and analysis: 50%
1 Lecture: Pipelining Extensions Topics: control hazards, multi-cycle instructions, pipelining equations.
现代计算机体系结构 主讲教师:张钢天津大学计算机学院 2009 年.
CSE431 L13 SS Execute & Commit.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 13: SS Backend (Execute, Writeback & Commit) Mary Jane.
Instruction-Level Parallelism and Its Dynamic Exploitation
Lecture: Out-of-order Processors
Lecture: Out-of-order Processors
CC 423: Advanced Computer Architecture Limits to ILP
Lecture 6: Advanced Pipelines
Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Lecture 10: Out-of-order Processors
Lecture 11: Out-of-order Processors
Lecture: Out-of-order Processors
Lecture 19: Branches, OOO Today’s topics: Instruction scheduling
Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Lecture 11: Memory Data Flow Techniques
Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
Lecture 10: Branch Prediction and Instruction Delivery
Instruction Level Parallelism (ILP)
Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
Adapted from the slides of Prof
Dynamic Hardware Prediction
How to improve (decrease) CPI
Lecture 10: ILP Innovations
Lecture 9: ILP Innovations
Lecture 9: Dynamic ILP Topics: out-of-order processors
Presentation transcript:

CS 7810 Paper critiques and class participation: 25% Final exam: 25% Project: Simplescalar (?) modeling, simulation, and analysis: 50% Read and think about the papers before class!

Superscalar Pipeline I - Cache PC BPred BTBBTB IFQ Rename Table ROBROB FU LSQ D-Cache checkpoints Issue queue op in1in2out FU Regfile

Rename A lr1  lr2 + lr3 B lr2  lr4 + lr5 C lr6  lr1 + lr3 D lr6  lr1 + lr2 RAR lr3 RAW lr1 WAR lr2 WAW lr6 A ; BC ; D pr7  pr2 + pr3 pr8  pr4 + pr5 pr9  pr7 + pr3 pr10  pr7 + pr8 RAR pr3 RAW pr7 WAR x WAW x AB ; CD

Resolving Branches A: lr1  lr2 + lr3 B: lr2  lr1 + lr4 C: lr1  lr4 + lr5 E: lr1  lr2 + lr3 D: lr2  lr1 + lr5 A: pr6  pr2 + pr3 B: pr7  pr6 + pr4 C: pr8  pr4 + pr5 E: pr10  pr7 + pr3 D: pr9  pr8 + pr5

Resolving Exceptions A lr1  lr2 + lr3 B lr2  lr1 + lr4 br C lr1  lr2 + lr3 D lr2  lr1 + lr5 pr6  pr2 + pr3 pr7  pr6 + pr4 br pr8  pr7 + pr3 pr9  pr8 + pr5

Resolving Exceptions A lr1  lr2 + lr3 B lr2  lr1 + lr4 br C lr1  lr2 + lr3 D lr2  lr1 + lr5 pr6  pr2 + pr3 pr7  pr6 + pr4 br pr8  pr7 + pr3 pr9  pr8 + pr5 A pr6 pr1 B pr7 pr2 C pr8 pr6 D pr9 pr7 ROB br

LSQ Ld/StAddressDataCompleted StoreUnknown Loadx Storex Loadx Loadx

LSQ Ld/StAddressDataCompleted StoreUnknown Loadx Storex Yes Loadx Loadx

LSQ Ld/StAddressDataCompleted StoreUnknown Loadx Storex Yes Loadx Yes Loadx

LSQ Ld/StAddressDataCompleted Storex Yes Loadx Storex Yes Loadx Yes Loadx can commit

LSQ Ld/StAddressDataCompleted Storex Yes Loadx Yes Storex Yes Loadx Yes Loadx Yes

Instruction Wake-Up addp6p1p2 addp7 subp8 mulp9 addp10 p2p6 p1p2 p7p8 p1p7 p2

Paper I Limits of Instruction-Level Parallelism David W. Wall WRL Research Report 93/6 Also appears in ASPLOS’91

Goals of the Study Under optimistic assumptions, you can find a very high degree of parallelism (1000!)  What about parallelism under realistic assumptions?  What are the bottlenecks? What contributes to parallelism?

Dependencies For registers and memory: True data dependency RAW Anti dependency WAR Output dependency WAW Control dependency Structural dependency

Perfect Scheduling For a long loop: Read a[i] and b[i] from memory and store in registers Add the register values Store the result in memory c[i] The whole program should finish in 3 cycles!! Anti and output dependences : the assembly code keeps using lr1 Control dependences : decision-making after each iteration Structural dependences : how many registers and cache ports do I have?

Impediments to Perfect Scheduling Register renaming Alias analysis Branch prediction Branch fanout Indirect jump prediction Window size and cycle width Latency

Register Renaming lr1  … pr22  … …  lr1 …  pr22 lr1  … pr24  … If the compiler had infinite registers, you would not have WAR and WAW dependences The hardware can renumber every instruction and extract more parallelism Implemented models:  None  Finite registers  Perfect (infinite registers – only RAW)

Alias Analysis You have to respect RAW dependences for memory as well – store value to addrA load from addrA Problem is: you do not know the address at compile-time or even during instruction dispatch

Alias Analysis Policies:  Perfect: You magically know all addresses and only delay loads that conflict with earlier stores  None: Until a store address is known, you stall every subsequent load  Analysis by compiler: (addr) does not conflict with (addr+4) – global and stack data are allocated by the compiler, hence conflicts can be detected – accesses to the heap can conflict with each other

Global, Stack, and Heap main() int a, b;  global data call func(); func() int c, d;  stack data int *e;  e,f are stack data int *f; e = (int *)malloc(8);  e, f point to heap data f = (int *)malloc(8); … *e = c;  store c into addr stored in e d = *f;  read value in addr stored in f This is a conflict if you had previously done e=e+8

Branch Prediction If you go the wrong way, you are not extracting useful parallelism You can predict the branch direction statically or dynamically You can execute along both directions and throw away some of the work (need more resources)

Dynamic Branch Prediction Tables of 2-bit counters that get biased towards being taken or not-taken Can use history (for each branch or global) Can have multiple predictors and dynamically pick the more promising one Much more in a few weeks…

Static Branch Prediction Profile the application and provide hints to the hardware Dynamic predictors are much better

Branch Fanout Execute both directions of the branch – an exponential growth in resource requirements Hence, do this until you encounter four branches, after which, you employ dynamic branch prediction Better still, execute both directions only if the prediction confidence is low Not commonly used in today’s processors.

Indirect Jumps Indirect jumps do not encode the target in the instruction – the target has to be computed The address can be predicted by  using a table to store the last target  using a stack to keep track of subroutine call and returns (the most common indirect jump) The combination achieves 95% prediction rates

Latency In their study, every instruction has unit latency -- highly questionable assumption today! They also model other “realistic” latencies Parallelism is being defined as cycles for sequential exec / cycles for superscalar, not as instructions / cycles Hence, increasing instruction latency can increase parallelism – not true for IPC

Window Size & Cycle Width 8 available slots in each cycle Window of 2048 instructions

Window Size & Cycle Width Discrete windows: grab 2048 instructions, schedule them, retire all cycles, grab the next window Continuous windows: grab 2048 instructions, schedule them, retire the oldest cycle, grab a few more instructions Window size and register renaming are not related

Simulated Models Seven models: control, register, and memory dependences Today’s processors: ? However, note optimistic scheduling, 2048 instr window, cycle width of 64, and 1-cycle latencies SPEC’92 benchmarks, utility programs (grep, sed, yacc), CAD tools

Simulated Models

Aggressive Models Parallelism steadily increases as we move to aggressive models (Fig 12, Pg. 16) Branch fanout does not buy much IPC of Great model: 10 Reality: 1.5 Numeric programs can do much better

Aggressive Models

Cycle Width and Window Size Unlimited cycle width buys very little (much less than 10%) (Figure 15) Decreasing the window size seems to have little effect as well (you need only 256?! – are registers the bottleneck?) (Figure 16) Unlimited window size and cycle widths don’t help (Figure 18) Would these results hold true today?

Memory Latencies The ability to prefetch has a huge impact on IPC – to hide a 300 cycle latency, you have to spot the instruction very early Hence, registers and window size are extremely important today!

Branch Prediction Obviously, better prediction helps (Fig. 22) Fanout does not help much (Fig. 24-b) – not selecting the right branches? Luckily, small tables are good enough for good indirect jump prediction Mispredict penalty has a major impact on ILP (Fig. 30)

Alias Analysis Has a big impact on performance – compiler analysis results in a two-fold speed-up Later, we’ll read a paper that attempts this in hardware (Chrysos ’98)

Instruction Latency Parallelism almost unaffected by increased latency (increases marginally in some cases!) Note: “unconventional” definition of parallelism Today, latency strongly influences IPC

Conclusions of Wall’s Study Branch prediction, alias analysis, mispredict penalty are huge bottlenecks Instr latency, registers, window size, cycle width are not huge bottlenecks Today, they are all huge bottlenecks because they all influence effective memory latency…which is the biggest bottleneck

Questions Weaknesses: caches, register model, value pred Will most of the available IPC (IPC = 10 for superb model) go away with realistic latencies? What stops us from building the following: 400Kb bpred, cache hierarchies, 512-entry window/regfile, 16 ALUs, memory dependence predictor? The following may need a re-evaluation: effect of window size, branch fan-out

Next Week’s Paper “Complexity-Effective Superscalar Processors”, Palacharla, Jouppi, Smith, ISCA ’97 The impact of increased issue width and window size on clock speed

Title Bullet