Presentation is loading. Please wait.

Presentation is loading. Please wait.

Korea UniversityG. Lee - 2009 1 CRE652 Processor Architecture Course Objective: To gain (1). knowledge on the current issues in processor architectures,

Similar presentations


Presentation on theme: "Korea UniversityG. Lee - 2009 1 CRE652 Processor Architecture Course Objective: To gain (1). knowledge on the current issues in processor architectures,"— Presentation transcript:

1 Korea UniversityG. Lee - 2009 1 CRE652 Processor Architecture Course Objective: To gain (1). knowledge on the current issues in processor architectures, and (2). skills for performing architecture research Ref. Text J. Smith and G. Sohi, “The Microarchitecture of Superscalar processor”, IEEE Spectrum, 1995. Papers from ISCA, MICRO, and ICCD Computer Architecture: A Quantitative Approach, Hennessy and Patterson, Morgan Kaufmann.

2 Korea UniversityG. Lee - 2009 2 Reg. File rename D-Cache Function units Superscalar Processor Model ROB I-Cache I Buffer BTB Instr. window Dispatch (scheduler) Rename reservation station IF DP IS WB CT VLIW – EPIC SMT

3 Korea UniversityG. Lee - 2009 3 Virtual address I-TLB D-TLB Page table Page table pointer register Entry with Dirty = 1 Memory Access Flow Cache Memory From program counter or Load/Store Instruction Processor

4 Korea UniversityG. Lee - 2009 4 Walls: Limit in performance ILP Wall Memory Wall Power Wall

5 Korea UniversityG. Lee - 2009 5 ILP(Instruction Level Parallelism) Fundamental limitation: data flow dependency Practical limiting factors Instruction Window Size Branch Prediction Data dependency Register Renaming Memory-address Alias Memory Disambiguation (Resource Conflicts) (Memory Latency due to cache-miss and lack of ports)

6 Korea UniversityG. Lee - 2009 6 ILP(Instruction Level Parallelism) With no limiting factors i.e. infinite window, infinite renaming registers, perfect branch prediction, and all memory addresses are exactly known, the average ILP in programs are known to be quite high. But with realistic limiting factors, IPC becomes fairly restricted.

7 Korea UniversityG. Lee - 2009 7 ILP Limit 1.Foster and Riseman, “percolation of code to enhance parallel dispatching and execution”, IEEE Trans. Computers, Vol. C-21, Dec. 1972. No. of Branches bypassedILP 0(basic block)1.72 12.72 23.61 87.21 3614.8 12824.4 ∞ 51.2

8 Korea UniversityG. Lee - 2009 8 ILP Limit 2.Spec92 H&P-Text Fig. 3.1 p. 157 ILP = 17.9 for li to 150.1 for tomcatv 3.M. A. Postiff, “The Limits of ILP in SPEC95 Applications”, INTERACT-3, ACM Computer Architecture News, Vol. 27, No.1, Mar. 1999 With no memory aliasing, 19.62 for li – 3933.03 for mgrid (61.47 for tomcatv) With stack dependency (for allocating activation record) removed, 81.45 for li – 4003.44 for mgrid

9 Korea UniversityG. Lee - 2009 9 ILP due to practical limiting factors Limiting Factors: ( H&P-text p. 152 – 170 ) Instruction Window Size more instructions to consider, better ILP potential Branch Prediction Accuracy less wasted cycles Renaming Registers more registers, better chance to remove WAR and WAW Memory Aliasing more accurate memory dependency Resources matching function unit types available to ILP

10 Korea UniversityG. Lee - 2009 10 ILP due to practical limiting factors Limiting Factor - Instruction Window Size Instruction Window;  set of instructions examined for simultaneous execution - reservation station + current fetch  max. no. of comparisons: no. of completing instructions X no. of instructions waiting to be issued X 2 (assuming at most two source operands/instr)  with typical window size of 64 to 128, time-critical

11 Korea UniversityG. Lee - 2009 11 ILP due to practical limiting factors Limiting Factor - Instruction Window Size e.g. (from H&P-Text Fig. 3.2 p. 159) ILP vs. window size note : 1. effects of window size 2. inefficiency of larger window

12 Korea UniversityG. Lee - 2009 12 ILP due to practical limiting factors Limiting Factor – Branch Prediction e.g. (from H&P-Text Fig. 3.3 p. 160) ILP vs. Branch prediction note : perf: perfect branch prediction comb: tournament predictor bi: bimodal predictor(2-bit counter) stat: static prediction with profiling none: no prediction note: instruction window size: 2K issue limit: 64 jmp prediction with 2K entry table

13 Korea UniversityG. Lee - 2009 13 ILP due to practical limiting factors Limiting Factor – Renaming Registers e.g. (from H&P-Text Fig. 3.5 p. 163) ILP vs. additional rename registers note: instruction window size: 2K issue limit: 64 combining predictor of total 8K entry jmp prediction with 2K entry table

14 Korea UniversityG. Lee - 2009 14 ILP due to practical limiting factors Limiting Factor – Memory Aliasing e.g ld$3, #200($4) st$5, #200($6) how to be sure about dependency between the two memory locations: ($4)+200 and ($6)+150 Perfect – after executing program Global reference and Stack references Global data region Stack access for local variables (activation records) Unknown, i.e. assume conflicts, for heap region for dynamic data structures Inspection – compile time region analysis

15 Korea UniversityG. Lee - 2009 15 ILP due to practical limiting factors Limiting Factor – Memory Aliasing e.g. (from H&P-Text Fig. 3.6 p. 164) ILP vs. aliasing detection schemes note: instruction window size: 2K issue limit: 64 with 256 registers combining predictor of total 8K entry jmp prediction with 2K entry table P: perfect alias resolution G/S: global/stack Ins: inspection

16 Korea UniversityG. Lee - 2009 16 ILP Limit A Realizable Superscalar Processor : H&P-Text sec.3.3 with rather realistic assumptions 64-issue with no issue restrictions Tournament predictor with 1K entries 16-entry jump return predictor 256 instruction window No alias within window 64 additional renaming registers note: no issue restriction is virtually impossible even for lower issue count, say 16.

17 Korea UniversityG. Lee - 2009 17 ILP Limit – Realistic Processor around 25%

18 Korea UniversityG. Lee - 2009 18 ILP Limit – Realistic Processor ILP potential in software  ILP limited by resources  Window size  Function unit mismatch  Registers  ILP limited by dependency  Branch prediction  False Dependency  Output dependency (WAW)  Data dependency (RAW)

19 Korea UniversityG. Lee - 2009 19 ProcessorMicro architectureFetch / Issue / Execute FUClock Rate (GHz) Trs Die size Power Intel Pentium 4 Extreme Speculative dynamically scheduled; deeply pipelined; SMT 3/3/47 int. 1 FP 3.8125 M 122 mm 2 115 W AMD Athlon 64 FX-57 Speculative dynamically scheduled 3/3/46 int. 3 FP 2.8114 M 115 mm 2 104 W IBM Power5 (1 CPU only) Speculative dynamically scheduled; SMT; 2 CPU cores/chip 8/4/86 int. 2 FP 1.9200 M 300 mm 2 (est.) 80W (est.) Intel Itanium 2 Statically scheduled VLIW-style 6/5/119 int. 2 FP 1.6592 M 423 mm 2 130 W Processor Architecture Comparison (H&P-Text Sec.3.6)

20 Korea UniversityG. Lee - 2009 20 Performance on SPECint2000

21 Korea UniversityG. Lee - 2009 21 Performance on SPECfp2000

22 Korea UniversityG. Lee - 2009 22 Normalized Performance: Efficiency Rank Itanium2Itanium2 PentIum4PentIum4 AthlonAthlon Power5Power5 Int/Trans4213 FP/Trans4213 Int/area4213 FP/area4213 Int/Watt4312 FP/Watt2431

23 Korea UniversityG. Lee - 2009 23 Superscalar processor N-way Superscalar:  Fetch and decode N instructions  N “ready” instructions “issued” to function units issue fetch, decode, renaming, dispatch, issue, execution, writeback/commit After issue, execution begins The maximum number of instruction a processor can send simultaneously is the “issue width”. Actual issue rate is much less Fetch=Decode > Issue = Execute > Commit

24 Korea UniversityG. Lee - 2009 24 Note: Can we keep going with Superscalar path for better performance? Increase instruction window Issue width Data path width → wire delay become more important factor → clustered organization may help frequent intra-cluster operations infrequent inter-cluster operations Simpler may be better? But it does not utilize available on-chip resources fully Adapting multiprocessor approach? How to control multiprocessors for multiple instructions

25 Korea UniversityG. Lee - 2009 25 Note: Removing dependency limit 1. Current practice/convention of programming model imposes unnecessary dependency WAR and WAW through memory  because of the way stack frame is allocated or deallocated, a procedure may reuse memory locations a previous procedure on the stack used specific use of registers  loop counter, return address register, stack pointer, 2. Going beyond data-flow limit Data Value prediction with speculation general value prediction; unlikely address value prediction constant/loop index value prediction

26 Korea UniversityG. Lee - 2009 26 Dealing with Other Walls Memory Wall  Faster Multilevel Cache  Non-blocking pipelined cache  Cache in multicore processor Transaction memory Power Wall  Lower driving voltage  Allowing errors

27 Korea UniversityG. Lee - 2009 27 Adding New Functionality Network and I/O related  Bypassing OS intervention Multimedia  Vector instructions Trusted Computing  Trusted Platform Module


Download ppt "Korea UniversityG. Lee - 2009 1 CRE652 Processor Architecture Course Objective: To gain (1). knowledge on the current issues in processor architectures,"

Similar presentations


Ads by Google