Download presentation
Presentation is loading. Please wait.
Published byCaroline Hartmann Modified over 6 years ago
1
Some misc. stuff An older real processor Class review/overview.
Last lecture Some misc. stuff An older real processor Class review/overview.
2
Misc. Status issues HW5 Answers posted Returned on Wednesday (next week) Project presentation signup at Locations are all over (see link) Need to be there for whole time slot. Exam review etc. Q&A session 1-2:45pm on Wednesday 2/23 Office hours see calendar.
3
Stuff still to do Oral report Written report Exam 4/25
Don’t forget to be there for the whole hour PowerPoint or other slides Either bring portable or USB stick Written report Due 9pm Tuesday via . Exam 4/25 1:30-3:30pm This room (1670 BBB)
4
AMD 64-bit core Most taken from http://www.chip-architect.com/
7
Bit-interleaved busses running “North-South”
10
Integer Decode/Dispatch
3 types of instructions Direct path RISC-like Vector path Broken into smaller instructions via micro code. Double 128-bit instructions which can be broken into 2 64-bit independent instructions are (called Double) Others are done via microcode Most 128-bit SSE and SSE2 are made into doubles.
11
RS Each cycle an instruction is issued into one of 3 lanes.
Each lane has 8 RSs 1 ALU 1 AGU (Address Generation Unit) Each RS sees broadcasts from all ALUs, AGUs, L/S units etc.
12
Rename Break the physical register file into 2 parts (sort of like P6 scheme with ARF/RoB) 72 in-flight instructions are kept in the RoB The other structure is the IFFRF: Integer Future File and Register File 16 registers of committed state 16 “future registers” 8 scratch-pad registers
13
Future file In the P6 scheme we had to look 3 places for the data
The PRF The RoB The CDB (later) Here we look in the FF or the CDB-like-things later. The FF holds the speculative value if it is known. At execution complete instructions check to see if they were the last thing to dispatch that writes to a given physical register. This is done by tagging the FF with the RoB number. If they were the last to have that AR as a destination, they update the FF.
14
How do we use the FF? At issue we: At EX complete we: At retire
Check the FF for source operands Reserve a spot in the RoB Place our tag (RoB number) in the FF Mark the FF entry as invalid At EX complete we: Send RoB number and data to the CDB Send data to the RoB Update FF if tag matches At retire update ARF value (from RoB) At mispredict Copy ARF value into FF.
15
What did the FF buy us? P6-like advantages
No free-list for PRF Can just clear the RAT on mis-predict. But no need to access the RoB looking for data RoB data only written once (EX complete) and only read once (Commit) Some pain Early branch resolution looks hard
16
ROB: An 8-bit descriptor for 72 entries
Re-Order-Buffer Tag definition wrap bit Instruction In Flight Number re-order buffer index sub-index 0..2 bit 7 bit 6 bit 5 bit 4 bit 3 bit 2 bit 1 bit 0 1) A sub-index 0,1 or 2 which identifies from which of the three lanes the instruction was dispatched. 2) A value that identifies the “cycle" in which the instruction was dispatched. The "cycle counter" wraps to 0 after reaching 23. 3) A wrap bit. When two instructions have different wrap bits then the cycle counter has wrapped between the dispatches.
17
More on the RoB What is basically happening is that we have three RoBs
Each one size 24 We cycle through each one so that none get ahead of the other. Reduces read/write ports!
18
Mispredictions It looks like they wait until retirement to resolve all exceptions. Mispredictions are treated as exceptions! They just clear everything and have the retired registers overwrite the speculative ones in the IFFRF
19
More details. Each x86 instruction can launch both an ALU and an AGU operation Because x86 has lots of memory operations this makes sense. ALUs broadcast result tag one cycle early So RS can launch data to the ALU before data arrives.
20
Lane 8
21
Class summary Major topics Less major topics
ILP in hardware (Out-of-order processors) How they work AND why we use them Caches and Virtual Memory Multi-processor ILP in software (Complier, IA-64) Power Less major topics Memory disambiguation Branch prediction Direction and target Advanced OoO issues Superscalar, instruction scheduling, multi-threading, etc.
22
The big questions What is computer architecture?
What are the metrics of performance? What are the techniques we use to maximize these metrics?
23
ILP in hardware (1/2) ILP definitions Dynamic Scheduling
Hazards vs dependencies Data, Name and Control dependencies What ILP means and finding it. Dynamic Scheduling Tomasulo’s (three versions!) You can be promised a question on this! Branch Prediction Local, global, hybrid/correlating Tournament and gshare BTBs
24
ILP in hardware (2/2) Multiple Issue Speculation ILP limit studies
Static Static Superscalar VLIW Dynamic superscalar Speculation Branch, data ILP limit studies
25
ILP in hardware: Questions
True or False The original T-algorithm only allows reordering within basic blocks In P6, if it weren’t for precise interrupts, it would be okay to retire instructions out-of-order as long as they had finished executing and a branch isn’t skipped over. ILP in hardware is limited in scope due to the “instruction window” which is basically the size of the RS.
26
Quick idea: SMT One processor, two threads.
27
Caching (1/2) There is a huge amount of stuff associated with caching. The important stuff Locality Temporal/Spatial 3’Cs model Stack distance model Nuts-and-bolts Replacement policies (LRU, pseudo-LRU) Performance (hit rate, Thit; Tmiss, average access time) Write back/Write thru Block size Basic improvement Multi-level cache Critical word first Write buffers
28
Caching (2/2) Non-standard caches Misc. Hash Victim Skew
Virtual addresses and caching Impact of prefetching Latency hiding with OO execution
29
Cache: Questions (1/2) Changing __________ has an impact on compulsory misses. A victim cache is more likely to help with ________ than ________ though it can help both (3’Cs) At least _____ bits are required to keep exact track of LRU in a 5-way associative cache.
30
Cache question (2/2) A ____________ cache has a number of sets equal to the number of lines in the cache. A fully-associative cache with N lines will miss an access that has a stack distance of ________ (state the largest range you can).
31
Multi-processor Amdahl’s law as it applies to MP.
Bus-based multi-processor Snooping MESI Bus transaction types (BRL etc.) Distributed-shared Directory schemes Synchronization Critical sections Spin-locks
32
Multi-processor: Question
Under the MESI protocol what is the advantage of having a distinct clean and dirty exclusive state?
33
Software techniques for ILP (1/2)
Pipeline scheduling Reordering instructions in a basic block to remove pipe stalls Loop unrolling Static information passed to processor Static branch prediction Static dependence information Loop issues Detecting loop dependencies Software pipelining
34
Software techniques for ILP (2/2)
Global code scheduling Predicated instruction and CMOV Memory reference speculation Issues with preserving exception behavior IA-64 as a case study of hardware support for software ILP techniques Speculative loads Advanced loads Software pipelining optimizations
35
Software techniques for ILP: Questions
What is the most significant disadvantage of loop unrolling? Using CMOV re-write the following code snippet, removing the branch. Don’t change exception behavior and assume DIV only causes an exception if R3=0 BNE R1 R2 skip R1=R2/R3 skip: nop
36
Power Understand why it’s important Power vs. Energy
How it’s related to the existence of multi-core Understand voltage scaling issues
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.