.1 Multiprocessor on a Chip & Simultaneous Multi-threads [Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005]

Slides:



Advertisements
Similar presentations
CS136, Advanced Architecture Limits to ILP Simultaneous Multithreading.
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 6: Multicore Systems
Multithreading processors Adapted from Bhuyan, Patterson, Eggers, probably others.
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
Lecture 29 Computer Arch. From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center for Education.
/ Computer Architecture and Design Instructor: Dr. Michael Geiger Summer 2014 Lecture 6: Speculation.
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
1 Burroughs B5500 multiprocessor. These machines were designed to support HLLs, such as Algol. They used a stack architecture, but part of the stack was.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
1 Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1)
1 Lecture 11: ILP Innovations and SMT Today: out-of-order example, ILP innovations, SMT (Sections 3.5 and supplementary notes)
SYNAR Systems Networking and Architecture Group CMPT 886: Architecture of Niagara I Processor Dr. Alexandra Fedorova School of Computing Science SFU.
Review: Multiprocessor Basics
CS 162 Computer Architecture Lecture 10: Multithreading Instructor: L.N. Bhuyan Adopted from Internet.
Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.
Chapter 17 Parallel Processing.
1 Lecture 12: ILP Innovations and SMT Today: ILP innovations, SMT, cache basics (Sections 3.5 and supplementary notes)
1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)
1 Lecture 26: Case Studies Topics: processor case studies, Flash memory Final exam stats:  Highest 83, median 67  70+: 16 students, 60-69: 20 students.
CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.
Multi-core Processing The Past and The Future Amir Moghimi, ASIC Course, UT ECE.
Hyper-Threading, Chip multiprocessors and both Zoran Jovanovic.
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.
Lecture 30Fall 2006 Computer Architecture Fall 2006 Lecture 30. CMPs & SMTs Adapted from Mary Jane Irwin ( ) [Adapted.
Multi-core architectures. Single-core computer Single-core CPU chip.
Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center.
1 Multi-core processors 12/1/09. 2 Multiprocessors inside a single chip It is now possible to implement multiple processors (cores) inside a single chip.
Lecture 11 Multithreaded Architectures Graduate Computer Architecture Fall 2005 Shih-Hao Hung Dept. of Computer Science and Information Engineering National.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? Multithreaded and multicore processors Marco D. Santambrogio:
Hardware Multithreading. Increasing CPU Performance By increasing clock frequency By increasing Instructions per Clock Minimizing memory access impact.
CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR
Lecture 28 Computer Arch. From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center for Education.
Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.
CSC 7080 Graduate Computer Architecture Lec 8 – Multiprocessors & Thread- Level Parallelism (3) – Sun T1 Dr. Khalaf Notes adapted from: David Patterson.
.1 Multiprocessor on a Chip & Simultaneous Multi-threads [Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005]
1 Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
Carnegie Mellon /18-243: Introduction to Computer Systems Instructors: Anthony Rowe and Gregory Kesden 27 th (and last) Lecture, 28 April 2011 Multi-Core.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
CSE431 L28 CMP&SMT.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 28. CMPs & SMTs Mary Jane Irwin ( )
COMP 740: Computer Architecture and Implementation
Electrical and Computer Engineering
/ Computer Architecture and Design
Simultaneous Multithreading
Lynn Choi School of Electrical Engineering
Simultaneous Multithreading
Multi-core processors
Computer Structure Multi-Threading
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
/ Computer Architecture and Design
Advanced Computer Architecture 5MD00 / 5Z033 SMT Simultaneously Multi-Threading Henk Corporaal TUEindhoven.
Electrical and Computer Engineering
Advanced Computer Architecture 5MD00 / 5Z033 SMT Simultaneously Multi-Threading Henk Corporaal TUEindhoven.
Lecture: SMT, Cache Hierarchies
Levels of Parallelism within a Single Processor
Limits to ILP Conflicting studies of amount
Hardware Multithreading
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor
How to improve (decrease) CPI
/ Computer Architecture and Design
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
The Vector-Thread Architecture
CSC3050 – Computer Architecture
Computer Systems Fall 2006 Lecture 28. CMPs & SMTs
Advanced Computer Architecture 5MD00 / 5Z032 SMT Simultaneously Multi-Threading Henk Corporaal TUEindhoven.
Presentation transcript:

.1 Multiprocessor on a Chip & Simultaneous Multi-threads [Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005]

.2 Review: Multiprocessor Basics # of Proc Communication model Message passing8 to 2048 Shared address NUMA8 to 256 UMA2 to 64 Physical connection Network8 to 256 Bus2 to 36 Q1 – How do they share data? Q2 – How do they coordinate? Q3 – How scalable is the architecture? How many processors?

.3 Multithreading: Interleave instructions from separate threads on the same hardware. Seen by OS as several CPUs. Multi-core: Integrating several processors that (partially) share a memory system on the same chip

.4 Recall: Bypass network prevents stalls Instead of bypass: Interleave threads on the pipeline to prevent stalls...

.5 CMP: Multiprocessors On One Chip By placing multiple processors, their memories and the Interface all on one chip, the latencies of chip-to-chip communication are drastically reduced –ARM multi-chip core Snoop Control Unit CPU L1$s CPU L1$s CPU L1$s CPU L1$s Interrupt Distributor CPU Interface CPU Interface CPU Interface CPU Interface Per-CPU aliased peripherals Configurable between 1 & 4 symmetric CPUs Private peripheral bus Configurable # of hardware intr Primary AXI R/W 64-b bus Optional AXI R/W 64-b bus Private IRQ

.6 Multithreading on A Chip Find a way to “hide” true data dependency stalls, cache miss stalls, and branch stalls by finding instructions (from other process threads) that are independent of those stalling instructions Multithreading – increase the utilization of resources on a chip by allowing multiple processes (threads) to share the functional units of a single processor –Processor must duplicate the state hardware for each thread – a separate register file, PC, instruction buffer, and store buffer for each thread –The caches, TLBs, BHT, BTB can be shared (although the miss rates may increase if they are not sized accordingly) –The memory can be shared through virtual memory mechanisms –Hardware must support efficient thread context switching

.7 Types of Multithreading Fine-grain – switch threads on every instruction issue –Round-robin thread interleaving (skipping stalled threads) –Processor must be able to switch threads on every clock cycle –Advantage – can hide throughput losses that come from both short and long stalls –Disadvantage – slows down the execution of an individual thread since a thread that is ready to execute without stalls is delayed by instructions from other threads Coarse-grain – switches threads only on costly stalls (e.g., L2 cache misses) –Advantages – thread switching doesn’t have to be essentially free and much less likely to slow down the execution of an individual thread –Disadvantage – limited, due to pipeline start-up costs, in its ability to overcome throughput loss Pipeline must be flushed and refilled on thread switches

.8 Multithreaded Example: Sun’s Niagara (UltraSparc T1) Eight fine grain multithreaded single-issue, in-order cores (no speculation, no dynamic branch prediction) Ultra IIINiagara Data width64-b Clock rate1.2 GHz1.0 GHz Cache (I/D/L2) 32K/64K/ (8M external) 16K/8K/3M Issue rate4 issue1 issue Pipe stages14 stages6 stages BHT entries16K x 2-bNone TLB entries128I/512D64I/64D Memory BW2.4 GB/s~20GB/s Transistors29 million200 million Power (max)53 W<60 W 4-way MT SPARC pipe Crossbar 4-way banked L2$ Memory controllers I/O shared funct’s

.9 Niagara Integer Pipeline Cores are simple (single-issue, 6 stage, no branch prediction), small, and power-efficient FetchThrd SelDecodeExecuteMemoryWB I$ ITLB Inst bufx4 PC logicx4 Decode RegFile x4 Thread Select Logic ALU Mul Shft Div D$ DTLB Stbufx4 Thrd Sel Mux Crossbar Interface Instr type Cache misses Traps & interrupts Resource conflicts From MPR, Vol. 18, #9, Sept. 2004

.10 Simultaneous Multithreading (SMT) A variation on multithreading that uses the resources of a multiple-issue, dynamically scheduled processor (Super Scalar) to exploit both program ILP and thread- level parallelism (TLP) –Most Super Scalar processors have more machine level parallelism than most programs can effectively use (i.e., than have ILP) –With register renaming and dynamic scheduling, multiple instructions from independent threads can be issued without regard to dependencies among them Need separate rename tables (ROBs) for each thread Need the capability to commit from multiple threads (i.e., from multiple ROBs) in one cycle Intel’s Pentium 4 SMT called hyperthreading –Supports just two threads (doubles the architecture state)

.11 Threading on a 4-way SS Processor Example Thread AThread B Thread CThread D Time → Issue slots → SMTFine MTCoarse MT

.12 Multicore Xbox360 – “Xenon” processor To provide game developers with a balanced and powerful platform –Three SMT processors, 32KB L1 D$ & I$, 1MB UL2 cache –165M transistors total –3.2 Ghz Near-POWER ISA –2-issue, 21 stage pipeline, with bit registers –Weak branch prediction – supported by software hinting –In order instructions –Narrow cores – 2 INT units, bit VMX (colloquial term for SIMD) units, 1 of anything else A 500MZ GPU w/ 512MB of DDR3DRAM –337M transistors, 10MB framebuffer –48 pixel shader cores, each with 4 ALUs

.13 Xenon Diagram Core 0 L1D L1I Core 1 L1D L1I Core 2 L1D L1I 1MB UL2 512MB DRAM GPU BIU/IO Intf 3D Core 10MB EDRAM Video Out MC0 MC1 Analog Chip XMA Dec SMC DVD HDD Port Front USBs (2) Wireless MU ports (2 USBs) Rear USB (1) Ethernet IR Audio Out Flash Systems Control Video Out