By Kunle Olukotun, BasemA. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung Chang Presented by Dheeraj Kumar Kaveti.

By Kunle Olukotun, BasemA. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung Chang Presented by Dheeraj Kumar Kaveti

 Trend: wide instruction issue super scalar processors  Limitations: More logic circuitry  Comparing performance: 6-issue dynamically scheduled superscalar processor with a 4 x two-issue multiprocessor.

The Limits of the Superscalar Approach The Case for a Single-Chip Multiprocessor Floor plans for a six-issue superscalar micro architecture and a 4 x2 way super scalar multiprocessor comparison of results of both the processors

 out of program order execution uses dynamic scheduling.  Hard ware to track register dependencies between instructions.  The three phases in a superscalar processors are Fetch,issue and execute

 Factors constrain instruction fetch: mispredicted branches, instruction misalignment and cache misses.  Even with good branch prediction and alignment a significant cache miss rate will limit performance.  Fortunately, it is possible to hide some of the instruction cache miss latency.

 There are two ways to implement renaming. 1. Explicit table for mapping architectural registers to physical 2. use a combination reorder buffer/instruction.  The advantage of the mapping table is that no comparisons are required for register renaming.  The disadvantage of the mapping table is that the number of access ports required.

 For example, a machine with 8 wide issue, 3 operand instructions, a 64-entry instruction queue, and 6-bit comparisons requires 9,216 1-bit comparators.  So it takes large area to implement.  This accounts for the long delays.  So queue will limit the performance.

 Wider instructions requires more register renaming.  The no. of ports required to satisfy the full instruction issue bandwidth also grows with issue width.  The better way to add ports to the data cache is by building a banked cache.  Added banked cache increases the access time of the cache.

 To increase the throughput.  Increasing wide spread of multimedia and use of visualization.  To execute the multiple threads in parallel that come from a single execution.  To accelerate execution of sequential applications with out manual intervention.

 Now the number of ports in instruction buffer now increased by 50% thus area of each buffer increased by 30-40%.  To handle out of order the instruction issue should occupy 30% of die but it has only 18%.  Also size of branch target buffer and call-return stack are increased to 2048 and 32 respectively,which increases the branch prediction accuracy.

 It has 4 processors arranged in a grid.  Size of each processor is less than one 4 th of 6-way SS processor.  Here the I cache and D cache and L2 are shared by four processors.  The Cache hit time is 5 cycles but for 6 way SS is 4 cycles.

 High delays are encountered with the Super scalar architecture.  Can exploit this parallelism so that the superscalar micro architecture is at most 10% better, even at the same clock rate.  large grained thread-level parallelism and multiprogramming workloads the multiprocessor performs 50--100% better than the wide superscalar micro architecture.

 [1] S.P. Amarasinghe, J. M. Anderson, M. S. Lam, and C.-W.Tseng, "An overview of the SUIF compiler for scalable parallel machines," Proceedings of the Seventh SIAM Conference on Parallel Processing for Scientific Compiler, San Francisco, 1995.  [2] S. Amarasinghe et.al., "Hot compilers for future hot chips,“ presented at Hot Chips VII, Stanford, CA, 1995.  [3] D.W. Anderson, F. J. Sparacio, and R. M. Tomasulo, "The IBM System/360 model 91: Machine philosophy and instruction-handling," IBM Journal of Research and Development, vol. 11, pp. 8-24, 1967.  [4] W. Bowhill et. al., "A 300MHz 64b quad-issue CMOS microprocessor," IEEE International Solid-State Circuits Conference Digest of Technical Papers, pp. 182-183, San Francisco, CA, 1995.  [5] E, Bugnion, J. Anderson, T. Mowry, M. Rosenblum, and M. Lam. "Compiler-Directed Page Coloring for Multiprocessors," Proceedings Seventh International Syrup. Architectural Support for Programming Languages and Operating Systems (ASPLOS VII), October 1996.  [6] "Chart watch: RISC processors," Microprocessor Report, vol. 10, no. 1, p. 22, January, 1996.

By Kunle Olukotun, BasemA. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung Chang Presented by Dheeraj Kumar Kaveti.

Similar presentations

Presentation on theme: "By Kunle Olukotun, BasemA. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung Chang Presented by Dheeraj Kumar Kaveti."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

By Kunle Olukotun, BasemA. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung Chang Presented by Dheeraj Kumar Kaveti.

Similar presentations

Presentation on theme: "By Kunle Olukotun, BasemA. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung Chang Presented by Dheeraj Kumar Kaveti."— Presentation transcript:

Similar presentations

About project

Feedback