Korea UniversityG. Lee CRE652 Processor Architecture Course Objective: To gain (1). knowledge on the current issues in processor architectures, and (2). skills for performing architecture research Ref. Text J. Smith and G. Sohi, “The Microarchitecture of Superscalar processor”, IEEE Spectrum, Papers from ISCA, MICRO, and ICCD Computer Architecture: A Quantitative Approach, Hennessy and Patterson, Morgan Kaufmann.
Korea UniversityG. Lee Reg. File rename D-Cache Function units Superscalar Processor Model ROB I-Cache I Buffer BTB Instr. window Dispatch (scheduler) Rename reservation station IF DP IS WB CT VLIW – EPIC SMT
Korea UniversityG. Lee Virtual address I-TLB D-TLB Page table Page table pointer register Entry with Dirty = 1 Memory Access Flow Cache Memory From program counter or Load/Store Instruction Processor
Korea UniversityG. Lee Walls: Limit in performance ILP Wall Memory Wall Power Wall
Korea UniversityG. Lee ILP(Instruction Level Parallelism) Fundamental limitation: data flow dependency Practical limiting factors Instruction Window Size Branch Prediction Data dependency Register Renaming Memory-address Alias Memory Disambiguation (Resource Conflicts) (Memory Latency due to cache-miss and lack of ports)
Korea UniversityG. Lee ILP(Instruction Level Parallelism) With no limiting factors i.e. infinite window, infinite renaming registers, perfect branch prediction, and all memory addresses are exactly known, the average ILP in programs are known to be quite high. But with realistic limiting factors, IPC becomes fairly restricted.
Korea UniversityG. Lee ILP Limit 1.Foster and Riseman, “percolation of code to enhance parallel dispatching and execution”, IEEE Trans. Computers, Vol. C-21, Dec No. of Branches bypassedILP 0(basic block) ∞ 51.2
Korea UniversityG. Lee ILP Limit 2.Spec92 H&P-Text Fig. 3.1 p. 157 ILP = 17.9 for li to for tomcatv 3.M. A. Postiff, “The Limits of ILP in SPEC95 Applications”, INTERACT-3, ACM Computer Architecture News, Vol. 27, No.1, Mar With no memory aliasing, for li – for mgrid (61.47 for tomcatv) With stack dependency (for allocating activation record) removed, for li – for mgrid
Korea UniversityG. Lee ILP due to practical limiting factors Limiting Factors: ( H&P-text p. 152 – 170 ) Instruction Window Size more instructions to consider, better ILP potential Branch Prediction Accuracy less wasted cycles Renaming Registers more registers, better chance to remove WAR and WAW Memory Aliasing more accurate memory dependency Resources matching function unit types available to ILP
Korea UniversityG. Lee ILP due to practical limiting factors Limiting Factor - Instruction Window Size Instruction Window; set of instructions examined for simultaneous execution - reservation station + current fetch max. no. of comparisons: no. of completing instructions X no. of instructions waiting to be issued X 2 (assuming at most two source operands/instr) with typical window size of 64 to 128, time-critical
Korea UniversityG. Lee ILP due to practical limiting factors Limiting Factor - Instruction Window Size e.g. (from H&P-Text Fig. 3.2 p. 159) ILP vs. window size note : 1. effects of window size 2. inefficiency of larger window
Korea UniversityG. Lee ILP due to practical limiting factors Limiting Factor – Branch Prediction e.g. (from H&P-Text Fig. 3.3 p. 160) ILP vs. Branch prediction note : perf: perfect branch prediction comb: tournament predictor bi: bimodal predictor(2-bit counter) stat: static prediction with profiling none: no prediction note: instruction window size: 2K issue limit: 64 jmp prediction with 2K entry table
Korea UniversityG. Lee ILP due to practical limiting factors Limiting Factor – Renaming Registers e.g. (from H&P-Text Fig. 3.5 p. 163) ILP vs. additional rename registers note: instruction window size: 2K issue limit: 64 combining predictor of total 8K entry jmp prediction with 2K entry table
Korea UniversityG. Lee ILP due to practical limiting factors Limiting Factor – Memory Aliasing e.g ld$3, #200($4) st$5, #200($6) how to be sure about dependency between the two memory locations: ($4)+200 and ($6)+150 Perfect – after executing program Global reference and Stack references Global data region Stack access for local variables (activation records) Unknown, i.e. assume conflicts, for heap region for dynamic data structures Inspection – compile time region analysis
Korea UniversityG. Lee ILP due to practical limiting factors Limiting Factor – Memory Aliasing e.g. (from H&P-Text Fig. 3.6 p. 164) ILP vs. aliasing detection schemes note: instruction window size: 2K issue limit: 64 with 256 registers combining predictor of total 8K entry jmp prediction with 2K entry table P: perfect alias resolution G/S: global/stack Ins: inspection
Korea UniversityG. Lee ILP Limit A Realizable Superscalar Processor : H&P-Text sec.3.3 with rather realistic assumptions 64-issue with no issue restrictions Tournament predictor with 1K entries 16-entry jump return predictor 256 instruction window No alias within window 64 additional renaming registers note: no issue restriction is virtually impossible even for lower issue count, say 16.
Korea UniversityG. Lee ILP Limit – Realistic Processor around 25%
Korea UniversityG. Lee ILP Limit – Realistic Processor ILP potential in software ILP limited by resources Window size Function unit mismatch Registers ILP limited by dependency Branch prediction False Dependency Output dependency (WAW) Data dependency (RAW)
Korea UniversityG. Lee ProcessorMicro architectureFetch / Issue / Execute FUClock Rate (GHz) Trs Die size Power Intel Pentium 4 Extreme Speculative dynamically scheduled; deeply pipelined; SMT 3/3/47 int. 1 FP M 122 mm W AMD Athlon 64 FX-57 Speculative dynamically scheduled 3/3/46 int. 3 FP M 115 mm W IBM Power5 (1 CPU only) Speculative dynamically scheduled; SMT; 2 CPU cores/chip 8/4/86 int. 2 FP M 300 mm 2 (est.) 80W (est.) Intel Itanium 2 Statically scheduled VLIW-style 6/5/119 int. 2 FP M 423 mm W Processor Architecture Comparison (H&P-Text Sec.3.6)
Korea UniversityG. Lee Performance on SPECint2000
Korea UniversityG. Lee Performance on SPECfp2000
Korea UniversityG. Lee Normalized Performance: Efficiency Rank Itanium2Itanium2 PentIum4PentIum4 AthlonAthlon Power5Power5 Int/Trans4213 FP/Trans4213 Int/area4213 FP/area4213 Int/Watt4312 FP/Watt2431
Korea UniversityG. Lee Superscalar processor N-way Superscalar: Fetch and decode N instructions N “ready” instructions “issued” to function units issue fetch, decode, renaming, dispatch, issue, execution, writeback/commit After issue, execution begins The maximum number of instruction a processor can send simultaneously is the “issue width”. Actual issue rate is much less Fetch=Decode > Issue = Execute > Commit
Korea UniversityG. Lee Note: Can we keep going with Superscalar path for better performance? Increase instruction window Issue width Data path width → wire delay become more important factor → clustered organization may help frequent intra-cluster operations infrequent inter-cluster operations Simpler may be better? But it does not utilize available on-chip resources fully Adapting multiprocessor approach? How to control multiprocessors for multiple instructions
Korea UniversityG. Lee Note: Removing dependency limit 1. Current practice/convention of programming model imposes unnecessary dependency WAR and WAW through memory because of the way stack frame is allocated or deallocated, a procedure may reuse memory locations a previous procedure on the stack used specific use of registers loop counter, return address register, stack pointer, 2. Going beyond data-flow limit Data Value prediction with speculation general value prediction; unlikely address value prediction constant/loop index value prediction
Korea UniversityG. Lee Dealing with Other Walls Memory Wall Faster Multilevel Cache Non-blocking pipelined cache Cache in multicore processor Transaction memory Power Wall Lower driving voltage Allowing errors
Korea UniversityG. Lee Adding New Functionality Network and I/O related Bypassing OS intervention Multimedia Vector instructions Trusted Computing Trusted Platform Module