Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.

Slides:



Advertisements
Similar presentations
CS136, Advanced Architecture Limits to ILP Simultaneous Multithreading.
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Lecture 6: Multicore Systems
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
Multithreading Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
Instruction Level Parallelism (ILP) Colin Stevens.
Multiscalar processors
How Multi-threading can increase on-chip parallelism
CS Lecture 20 The Case for a Single-Chip Multiprocessor K. Olukotun, B.A. Nayfeh, L. Hammond, K. Wilson, K-Y. Chang Proceedings of ASPLOS-VII October.
Simultaneous Multithreading: Multiplying Alpha Performance Dr. Joel Emer Principal Member Technical Staff Alpha Development Group Compaq.
1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)
Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.
Multi-core Processing The Past and The Future Amir Moghimi, ASIC Course, UT ECE.
Joram Benham April 2,  Introduction  Motivation  Multicore Processors  Overview, CELL  Advantages of CMPs  Throughput, Latency  Challenges.
Hyper-Threading, Chip multiprocessors and both Zoran Jovanovic.
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.
Multi-core architectures. Single-core computer Single-core CPU chip.
Multi-Core Architectures
1 Multi-core processors 12/1/09. 2 Multiprocessors inside a single chip It is now possible to implement multiple processors (cores) inside a single chip.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? Multithreaded and multicore processors Marco D. Santambrogio:
Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.
Hardware Multithreading. Increasing CPU Performance By increasing clock frequency By increasing Instructions per Clock Minimizing memory access impact.
CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
Advanced Processor Technology Architectural families of modern computers are CISC RISC Superscalar VLIW Super pipelined Vector processors Symbolic processors.
SIMULTANEOUS MULTITHREADING Ting Liu Liu Ren Hua Zhong.
Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.
© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Lecture 4: Microarchitecture: Overview and General Trends.
On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2.
HyperThreading ● Improves processor performance under certain workloads by providing useful work for execution units that would otherwise be idle ● Duplicates.
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
E6200, Fall 07, Oct 24Ambale: CMP1 Bharath Ambale Venkatesh 10/24/2007.
By Kunle Olukotun, BasemA. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung Chang Presented by Dheeraj Kumar Kaveti.
Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.
My Coordinates Office EM G.27 contact time:
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,
COMP 740: Computer Architecture and Implementation
CS Lecture 20 The Case for a Single-Chip Multiprocessor
Simultaneous Multithreading
Simultaneous Multithreading
Multi-core processors
Computer Structure Multi-Threading
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
/ Computer Architecture and Design
Hyperthreading Technology
Lecture: SMT, Cache Hierarchies
Levels of Parallelism within a Single Processor
Computer Architecture Lecture 4 17th May, 2006
Hardware Multithreading
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor
Coe818 Advanced Computer Architecture
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Levels of Parallelism within a Single Processor
Hardware Multithreading
Lecture 22: Multithreading
ECE 721, Spring 2019 Prof. Eric Rotenberg.
Presentation transcript:

Single-Chip Multiprocessor Nirmal Andrews

Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More transistors per chip) - Cost of wires Many studies done in Stanford University during late 90’s and proved CMP (single-chip multiprocessor) is better than competing technology.

Parallelism Parallelism becomes a necessity for improving performance. Parallelism made possible using dynamic scheduling, multiple instruction issue, speculative execution, non-blocking caches etc., (late 90’s) Parallelism classifications: Instruction level Loop level Thread level - Future trend Process level - Future trend

Loop level parallelism To increase amount of parallelism – exploit parallelism among iterations of a loop. ILP that results from data independent loop iterations is LLP. No circular dependencies. This could be avoided too using loop unrolling (beyond the scope of this lecture).

LLP (Loop Level Parallelism) 15% ILP extracted from a basic block in an integer programs – 7 instructions. for (i=1; i<=100; i= i+1) { a[i] = a[i] + b[i]; //s1 b[i+1] = c[i] + d[i]; //s2 } s1 depends on s2. So to extract LLP rearrange: a[1] = a[1] + b[1]; for (i=1; i<=99; i= i+1) { b[i+1] = c[i] + d[i]; a[i+1] = a[i+1] + b[i+1]; } b[101] = c[100] + d[100]; No dependencies.

Competing technology - Superscalar Executing multiple instruction in the same clock cycle. Dynamic scheduling Single processor Redundant functional units on processor Mixture between a scalar and vector processor

Competing technology - Superscalar

Wide issue superscalar

Fetch Phase 3 phase: Fetch, Issue, Execution Bottlenecks: Issue and Execution phase. Fetch phase: Provide large and accurate window of decoded instructions - 3 issues: instruction misalignment, cache miss, mispredicted branch. - misprediction reduced to under 5% using branch predictor designed by McFarling*. - instruction misalignement reduced to under 3% by dividing cache into banks (Conte). - Roesnblum et al. shows that the 60% of latency by cache miss can be hidden**. * S. McFarling, “Combining branch predictors,” WRL Technical Note TN-36, Digital Equipment Corporation, ** M. Rosenblum, E. Bugnion, S. Herrod, E. Witchel, and A. Gupta, “The impact of architectural trends on operating system performance,” Proceedings of 15th ACM symposium on Operating Systems Principles, Colorado,December, 1995.

Issue phase Issue phase: Register renaming. 2 techniques for register renaming: - Use a table to map architectural registers and physical registers. Ports required: operands per instruction* Instruction window size - Use reorder buffer. Comparators required to find which physical register should provide data to which packet of instruction. Large number of comparators required. In HP – PA % of die space occupied by comparators. Quadratic increase in instruction queue register with increase in issue width. Queue register uses broadcast to connect to registers which increases the wires used – increased delay and cost

Execute phase Execution phase also has similar issues. Increase in issue width causes increase in renamed registers leading to quadratic increase in register file complexity. Increase in execution unit causes quadratic increase in the complexity of bypass logic. Bottleneck: Interconnect delay between execution units.

Architecture - Superscalar

Competing technologies – Simultaneous Multi Threading Simultaneous Multi threading architecture is similar to that of the superscalar. SMT processors support wide superscalar processors with hardware, to execute instructions from multiple thread concurrently. Provides latency tolerance. Reduces to conventional wide-issue superscalar when no multiple threads possible.

Competing technologies - Simultaneous Multi Threading SMT (Simultaneous Multi-threading)

Centralized architecture Disadvantages of centralized architectures such as SMT and Superscalars are: - Area increases quadratically with core’s complexity. - Increase in cycle time – interconnect delays. Delay with wires dominate delay of critical path of CPU. Possible to make simpler clusters, but results in deeper pipeline and increase in branch misprediction penalty. - Design verification cost high, due to complexity and single processor - Large demand on memory system.

Single Chip multiprocessor Motivation for a decentralized architecture due to the disadvantages of competing technologies. Simple individual processors and high clock rate. Low interconnect latency Exploits thread level and processor level parallelism.

Single chip Multiprocessor architecture

Performance comparison Example: 8 core Cell processor in the PS3 and the 3 core Xenon processor in the Xbox 360)CellPS3XenonXbox 360 Performance chart Run for different benchmark programs.

Summary (CMP) CMP (Chip level multiprocessor) provides superior performance with simpler hardware. No parallelism – Superscalar performance is 30% better than CMP Fine grained thread-level parallelism – Superscalar is 10% better in performance Coarse grained thread-level parallelism – CMP is % better than superscalar. Disadvantage – Slow when no multithreading, equal development of software required.

Reference K Olukotun, BA Nayfeh, L Hammond, K Wilson, K, “The case for a single-chip multiprocessor,” ACM SIGPLAN Notices, 1996.The case for a single-chip multiprocessor L Hammond, BA Nayfeh, K Olukotun, “A Single-Chip Multiprocessor,” IEEE, Sept 1997.A Single-Chip Multiprocessor Wikipedia, “Superscalar,” April Wikipedia, “Multi-core,” April