CSE431 Chapter 7A.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7A: Intro to Multiprocessor Systems Mary Jane Irwin ( www.cse.psu.edu/~mji.

Slides:



Advertisements
Similar presentations
Parallelism Lecture notes from MKP and S. Yalamanchili.
Advertisements

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Distributed Systems CS
SE-292 High Performance Computing
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Structure of Computer Systems
Computer Abstractions and Technology
Chapter 8-1 : Multiple Processor Systems Multiple Processor Systems Multiple Processor Systems Multiprocessor Hardware Multiprocessor Hardware UMA Multiprocessors.
Princess Sumaya Univ. Computer Engineering Dept. Chapter 7:
Review: Multiprocessor Systems (MIMD)
1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability.
1 Burroughs B5500 multiprocessor. These machines were designed to support HLLs, such as Algol. They used a stack architecture, but part of the stack was.
An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu
Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.
1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.
11/14/05ELEC Fall Multi-processor SoCs Yijing Chen.
BusMultis.1 Review: Where are We Now? Processor Control Datapath Memory Input Output Input Output Memory Processor Control Datapath  Multiprocessor –
Introduction What is Parallel Algorithms? Why Parallel Algorithms? Evolution and Convergence of Parallel Algorithms Fundamental Design Issues.
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
1 CSE SUNY New Paltz Chapter Nine Multiprocessors.
1  2004 Morgan Kaufmann Publishers Chapter 9 Multiprocessors.
Chapter 7 Multicores, Multiprocessors, and Clusters.
CPE 731 Advanced Computer Architecture Multiprocessor Introduction
MultiIntro.1 The Big Picture: Where are We Now? Processor Control Datapath Memory Input Output Input Output Memory Processor Control Datapath  Multiprocessor.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Multi-core Processing The Past and The Future Amir Moghimi, ASIC Course, UT ECE.
Computer System Architectures Computer System Software
Lecture 22Fall 2006 Computer Systems Fall 2006 Lecture 22: Intro. to Multiprocessors Adapted from Mary Jane Irwin ( )
.1 Intro to Multiprocessors [Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005]
Last Time Performance Analysis It’s all relative
1 Parallelism, Multicores, Multiprocessors, and Clusters [Adapted from Computer Organization and Design, Fourth Edition, Patterson & Hennessy, © 2009]
EET 4250: Chapter 1 Computer Abstractions and Technology Acknowledgements: Some slides and lecture notes for this course adapted from Prof. Mary Jane Irwin.
August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.
Lecture 1: Performance EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2013, Dr. Rozier.
CMPE 421 Parallel Computer Architecture Multi Processing 1.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
Memory/Storage Architecture Lab Computer Architecture Multiprocessors.
Chapter 1 Performance & Technology Trends Read Sections 1.5, 1.6, and 1.8.
.1 Intro to Multiprocessors. .2 The Big Picture: Where are We Now? Processor Control Datapath Memory Input Output Input Output Memory Processor Control.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
+ Clusters Alternative to SMP as an approach to providing high performance and high availability Particularly attractive for server applications Defined.
Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,
DR. SIMING LIU SPRING 2016 COMPUTER SCIENCE AND ENGINEERING UNIVERSITY OF NEVADA, RENO Session 3 Computer Evolution.
August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,
Lecture 28Fall 2006 Computer Architecture Fall 2006 Lecture 28: Bus Connected Multiprocessors Adapted from Mary Jane Irwin ( )
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
CS61C L20 Thread Level Parallelism II (1) Garcia, Spring 2013 © UCB Senior Lecturer SOE Dan Garcia inst.eecs.berkeley.edu/~cs61c.
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
CSE431 L25 MultiIntro.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 25. Intro to Multiprocessors Mary Jane Irwin (
CS5102 High Performance Computer Systems Thread-Level Parallelism
Parallel Processing - introduction
Morgan Kaufmann Publishers
Morgan Kaufmann Publishers
CSE 431 Computer Architecture Fall Lecture 26. Bus Connected Multi’s
Shared Memory Multiprocessors
Chapter 17 Parallel Processing
Introduction to Multiprocessors
Distributed Systems CS
Computer Evolution and Performance
CSC3050 – Computer Architecture
Chapter 4 Multiprocessors
Computer Systems Spring 2008 Lecture 25: Intro. to Multiprocessors
Presentation transcript:

CSE431 Chapter 7A.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7A: Intro to Multiprocessor Systems Mary Jane Irwin ( ) [Adapted from Computer Organization and Design, 4 th Edition, Patterson & Hennessy, © 2008, MK]

CSE431 Chapter 7A.2Irwin, PSU, 2008 The Big Picture: Where are We Now?  Multiprocessor – a computer system with at least two processors l Can deliver high throughput for independent jobs via job-level parallelism or process-level parallelism l And improve the run time of a single program that has been specially crafted to run on a multiprocessor - a parallel processing program Processor Cache Interconnection Network Memory I/O

CSE431 Chapter 7A.3Irwin, PSU, 2008 Multicores Now Common  The power challenge has forced a change in the design of microprocessors l Since 2002 the rate of improvement in the response time of programs has slowed from a factor of 1.5 per year to less than a factor of 1.2 per year  Today’s microprocessors typically contain more than one core – Chip Multicore microProcessors (CMPs) – in a single IC l The number of cores is expected to double every two years ProductAMD Barcelona Intel Nehalem IBM Power 6 Sun Niagara 2 Cores per chip4428 Clock rate2.5 GHz~2.5 GHz?4.7 GHz1.4 GHz Power120 W~100 W? 94 W

CSE431 Chapter 7A.4Irwin, PSU, 2008 Other Multiprocessor Basics  Some of the problems that need higher performance can be handled simply by using a cluster – a set of independent servers (or PCs) connected over a local area network (LAN) functioning as a single large multiprocessor l Search engines, Web servers, servers, databases, …  A key challenge is to craft parallel (concurrent) programs that have high performance on multiprocessors as the number of processors increase – i.e., that scale l Scheduling, load balancing, time for synchronization, overhead for communication

CSE431 Chapter 7A.5Irwin, PSU, 2008 Encountering Amdahl’s Law  Speedup due to enhancement E is Speedup w/ E = Exec time w/o E Exec time w/ E  Suppose that enhancement E accelerates a fraction F (F 1) and the remainder of the task is unaffected ExTime w/ E = ExTime w/o E  Speedup w/ E =

CSE431 Chapter 7A.6Irwin, PSU, 2008 Encountering Amdahl’s Law  Speedup due to enhancement E is Speedup w/ E = Exec time w/o E Exec time w/ E  Suppose that enhancement E accelerates a fraction F (F 1) and the remainder of the task is unaffected ExTime w/ E = ExTime w/o E  ((1-F) + F/S) Speedup w/ E = 1 / ((1-F) + F/S)

CSE431 Chapter 7A.7Irwin, PSU, 2008 Example 1: Amdahl’s Law  Consider an enhancement which runs 20 times faster but which is only usable 25% of the time. Speedup w/ E =  What if its usable only 15% of the time? Speedup w/ E =  Amdahl’s Law tells us that to achieve linear speedup with 100 processors, none of the original computation can be scalar!  To get a speedup of 90 from 100 processors, the percentage of the original program that could be scalar would have to be 0.1% or less Speedup w/ E =

CSE431 Chapter 7A.8Irwin, PSU, 2008 Example 1: Amdahl’s Law  Consider an enhancement which runs 20 times faster but which is only usable 25% of the time. Speedup w/ E = 1/( /20) = 1.31  What if its usable only 15% of the time? Speedup w/ E = 1/( /20) = 1.17  Amdahl’s Law tells us that to achieve linear speedup with 100 processors, none of the original computation can be scalar!  To get a speedup of 90 from 100 processors, the percentage of the original program that could be scalar would have to be 0.1% or less Speedup w/ E = 1/( /100) = Speedup w/ E = 1 / ((1-F) + F/S)

CSE431 Chapter 7A.9Irwin, PSU, 2008 Example 2: Amdahl’s Law  Consider summing 10 scalar variables and two 10 by 10 matrices (matrix sum) on 10 processors Speedup w/ E =  What if there are 100 processors ? Speedup w/ E =  What if the matrices are100 by 100 (or 10,010 adds in total) on 10 processors? Speedup w/ E =  What if there are 100 processors ? Speedup w/ E = Speedup w/ E = 1 / ((1-F) + F/S)

CSE431 Chapter 7A.10Irwin, PSU, 2008 Example 2: Amdahl’s Law  Consider summing 10 scalar variables and two 10 by 10 matrices (matrix sum) on 10 processors Speedup w/ E = 1/( /10) = 1/ = 5.5  What if there are 100 processors ? Speedup w/ E = 1/( /100) = 1/ = 10.0  What if the matrices are100 by 100 (or 10,010 adds in total) on 10 processors? Speedup w/ E = 1/( /10) = 1/ = 9.9  What if there are 100 processors ? Speedup w/ E = 1/( /100) = 1/ = 91 Speedup w/ E = 1 / ((1-F) + F/S)

CSE431 Chapter 7A.11Irwin, PSU, 2008 Scaling  To get good speedup on a multiprocessor while keeping the problem size fixed is harder than getting good speedup by increasing the size of the problem. l Strong scaling – when speedup can be achieved on a multiprocessor without increasing the size of the problem l Weak scaling – when speedup is achieved on a multiprocessor by increasing the size of the problem proportionally to the increase in the number of processors  Load balancing is another important factor. Just a single processor with twice the load of the others cuts the speedup almost in half

CSE431 Chapter 7A.12Irwin, PSU, 2008 Multiprocessor/Clusters Key Questions  Q1 – How do they share data?  Q2 – How do they coordinate?  Q3 – How scalable is the architecture? How many processors can be supported?

CSE431 Chapter 7A.13Irwin, PSU, 2008 Shared Memory Multiprocessor (SMP)  Q1 – Single address space shared by all processors  Q2 – Processors coordinate/communicate through shared variables in memory (via loads and stores) l Use of shared data must be coordinated via synchronization primitives (locks) that allow access to data to only one processor at a time  They come in two styles l Uniform memory access (UMA) multiprocessors l Nonuniform memory access (NUMA) multiprocessors  Programming NUMAs are harder  But NUMAs can scale to larger sizes and have lower latency to local memory

CSE431 Chapter 7A.14Irwin, PSU, 2008 Summing 100,000 Numbers on 100 Proc. SMP sum[Pn] = 0; for (i = 1000*Pn; i< 1000*(Pn+1); i = i + 1) sum[Pn] = sum[Pn] + A[i];  Processors start by running a loop that sums their subset of vector A numbers (vectors A and sum are shared variables, Pn is the processor’s number, i is a private variable)  The processors then coordinate in adding together the partial sums ( half is a private variable initialized to 100 (the number of processors)) – reduction repeat synch();/*synchronize first if (half%2 != 0 && Pn == 0) sum[0] = sum[0] + sum[half-1]; half = half/2 if (Pn<half) sum[Pn] = sum[Pn] + sum[Pn+half] until (half == 1);/*final sum in sum[0]

CSE431 Chapter 7A.15Irwin, PSU, 2008 An Example with 10 Processors P0P1P2P3P4P5P6P7P8P9 sum[P0]sum[P1]sum[P2]sum[P3]sum[P4]sum[P5]sum[P6]sum[P7]sum[P8]sum[P9] half = 10

CSE431 Chapter 7A.16Irwin, PSU, 2008 An Example with 10 Processors P0P1P2P3P4P5P6P7P8P9 sum[P0]sum[P1]sum[P2]sum[P3]sum[P4]sum[P5]sum[P6]sum[P7]sum[P8]sum[P9] P0 P1P2P3P4 half = 10 half = 5 P1 half = 2 P0 half = 1

CSE431 Chapter 7A.17Irwin, PSU, 2008 Process Synchronization  Need to be able to coordinate processes working on a common task  Lock variables (semaphores) are used to coordinate or synchronize processes  Need an architecture-supported arbitration mechanism to decide which processor gets access to the lock variable l Single bus provides arbitration mechanism, since the bus is the only path to memory – the processor that gets the bus wins  Need an architecture-supported operation that locks the variable Locking can be done via an atomic swap operation (on the MIPS we have ll and sc one example of where a processor can both read a location and set it to the locked state – test-and-set – in the same bus operation)

CSE431 Chapter 7A.18Irwin, PSU, 2008 Spin Lock Synchronization Read lock variable using ll Succeed? Try to lock variable using sc : set it to locked value of 1 Unlocked? (=0?) No Yes NoBegin update of shared data Finish update of shared data Yes unlock variable: set lock variable to 0 Spin atomic operation The single winning processor will succeed in writing a 1 to the lock variable - all others processors will get a return code of 0 Return code = 0

CSE431 Chapter 7A.19Irwin, PSU, 2008 Review: Summing Numbers on a SMP sum[Pn] = 0; for (i = 1000*Pn; i< 1000*(Pn+1); i = i + 1) sum[Pn] = sum[Pn] + A[i]; /* each processor sums its /* subset of vector A  Pn is the processor’s number, vectors A and sum are shared variables, i is a private variable, half is a private variable initialized to the number of processors repeat/* adding together the /* partial sums synch();/*synchronize first if (half%2 != 0 && Pn == 0) sum[0] = sum[0] + sum[half-1]; half = half/2 if (Pn<half) sum[Pn] = sum[Pn] + sum[Pn+half]; until (half == 1);/*final sum in sum[0]

CSE431 Chapter 7A.20Irwin, PSU, 2008 An Example with 10 Processors sum[P0]sum[P1]sum[P2]sum[P3]sum[P4]sum[P5]sum[P6]sum[P7]sum[P8]sum[P9] P0P1P2P3P4P5P6P7P8P9 P0P1P2P3P4  synch() : Processors must synchronize before the “consumer” processor tries to read the results from the memory location written by the “producer” processor l Barrier synchronization – a synchronization scheme where processors wait at the barrier, not proceeding until every processor has reached it

CSE431 Chapter 7A.21Irwin, PSU, 2008 Barrier Implemented with Spin-Locks lock(arrive); count := count + 1;/* count the processors as if count < n/* they arrive at barrier then unlock(arrive) else unlock(depart);  n is a shared variable initialized to the number of processors,count is a shared variable initialized to 0, arrive and depart are shared spin-lock variables where arrive is initially unlocked and depart is initially locked lock(depart); count := count - 1;/* count the processors as if count > 0/* they leave barrier then unlock(depart) else unlock(arrive); procedure synch()

CSE431 Chapter 7A.22Irwin, PSU, 2008 Spin-Locks on Bus Connected ccUMAs  With a bus based cache coherency protocol (write invalidate), spin-locks allow processors to wait on a local copy of the lock in their caches l Reduces bus traffic – once the processor with the lock releases the lock (writes a 0) all other caches see that write and invalidate their old copy of the lock variable. Unlocking restarts the race to get the lock. The winner gets the bus and writes the lock back to 1. The other caches then invalidate their copy of the lock and on the next lock read fetch the new lock value (1) from memory.  This scheme has problems scaling up to many processors because of the communication traffic when the lock is released and contested

CSE431 Chapter 7A.23Irwin, PSU, 2008 Aside: Cache Coherence Bus Traffic Proc P0Proc P1Proc P2Bus activityMemory 1Has lockSpins None 2Releases lock (0) Spins Bus services P0’s invalidate 3Cache miss Bus services P2’s cache miss 4WaitsReads lock (0) Response to P2’s cache miss Update lock in memory from P0 5Reads lock (0) Swaps lock ( ll,sc of 1) Bus services P1’s cache miss 6Swaps lock ( ll,sc of 1) Swap succeeds Response to P1’s cache miss Sends lock variable to P1 7Swap failsHas lockBus services P2’s invalidate 8SpinsHas lockBus services P1’s cache miss

CSE431 Chapter 7A.24Irwin, PSU, 2008 Message Passing Multiprocessors (MPP)  Each processor has its own private address space  Q1 – Processors share data by explicitly sending and receiving information (message passing)  Q2 – Coordination is built into message passing primitives (message send and message receive) Processor Cache Interconnection Network Memory

CSE431 Chapter 7A.25Irwin, PSU, 2008 Summing 100,000 Numbers on 100 Proc. MPP sum = 0; for (i = 0; i<1000; i = i + 1) sum = sum + Al[i];/* sum local array subset  Start by distributing 1000 elements of vector A to each of the local memories and summing each subset in parallel  The processors then coordinate in adding together the sub sums ( Pn is the number of processors, send(x,y ) sends value y to processor x, and receive() receives a value) half = 100; limit = 100; repeat half = (half+1)/2;/*dividing line if (Pn>= half && Pn<limit) send(Pn-half,sum); if (Pn<(limit/2)) sum = sum + receive(); limit = half; until (half == 1);/*final sum in P0’s sum

CSE431 Chapter 7A.26Irwin, PSU, 2008 An Example with 10 Processors P0P1P2P3P4P5P6P7P8P9 sum half = 10

CSE431 Chapter 7A.27Irwin, PSU, 2008 An Example with 10 Processors P0P1P2P3P4P5P6P7P8P9 P0P1P2P3P4 half = 10 half = 5 half = 3 half = 2 sum send receive P0P1P2 limit = 10 limit = 5 limit = 3 limit = 2 half = 1 P0P1P0 send receive send receive send receive

CSE431 Chapter 7A.28Irwin, PSU, 2008 Pros and Cons of Message Passing  Message sending and receiving is much slower than addition, for example  But message passing multiprocessors and much easier for hardware designers to design l Don’t have to worry about cache coherency for example  The advantage for programmers is that communication is explicit, so there are fewer “performance surprises” than with the implicit communication in cache-coherent SMPs. l Message passing standard MPI-2 ( )  However, its harder to port a sequential program to a message passing multiprocessor since every communication must be identified in advance. l With cache-coherent shared memory the hardware figures out what data needs to be communicated

CSE431 Chapter 7A.29Irwin, PSU, 2008 Networks of Workstations (NOWs) Clusters  Clusters of off-the-shelf, whole computers with multiple private address spaces connected using the I/O bus of the computers l lower bandwidth than multiprocessor that use the processor- memory (front side) bus l lower speed network links l more conflicts with I/O traffic  Clusters of N processors have N copies of the OS limiting the memory available for applications  Improved system availability and expandability l easier to replace a machine without bringing down the whole system l allows rapid, incremental expandability  Economy-of-scale advantages with respect to costs

CSE431 Chapter 7A.30Irwin, PSU, 2008 Commercial (NOW) Clusters ProcProc Speed # ProcNetwork Dell PowerEdge P4 Xeon3.06GHz2,500Myrinet eServer IBM SP Power41.7GHz2,944 VPI BigMacApple G52.3GHz2,200Mellanox Infiniband HP ASCI QAlpha GHz8,192Quadrics LLNL Thunder Intel Itanium21.4GHz1,024*4Quadrics BarcelonaPowerPC GHz4,536Myrinet

CSE431 Chapter 7A.31Irwin, PSU, 2008 Multithreading on A Chip  Find a way to “hide” true data dependency stalls, cache miss stalls, and branch stalls by finding instructions (from other process threads) that are independent of those stalling instructions  Hardware multithreading – increase the utilization of resources on a chip by allowing multiple processes (threads) to share the functional units of a single processor l Processor must duplicate the state hardware for each thread – a separate register file, PC, instruction buffer, and store buffer for each thread l The caches, TLBs, BHT, BTB, RUU can be shared (although the miss rates may increase if they are not sized accordingly) l The memory can be shared through virtual memory mechanisms l Hardware must support efficient thread context switching

CSE431 Chapter 7A.32Irwin, PSU, 2008 Types of Multithreading  Fine-grain – switch threads on every instruction issue l Round-robin thread interleaving (skipping stalled threads) l Processor must be able to switch threads on every clock cycle l Advantage – can hide throughput losses that come from both short and long stalls l Disadvantage – slows down the execution of an individual thread since a thread that is ready to execute without stalls is delayed by instructions from other threads  Coarse-grain – switches threads only on costly stalls (e.g., L2 cache misses) l Advantages – thread switching doesn’t have to be essentially free and much less likely to slow down the execution of an individual thread l Disadvantage – limited, due to pipeline start-up costs, in its ability to overcome throughput loss -Pipeline must be flushed and refilled on thread switches

CSE431 Chapter 7A.33Irwin, PSU, 2008 Multithreaded Example: Sun’s Niagara (UltraSparc T2)  Eight fine grain multithreaded single-issue, in-order cores (no speculation, no dynamic branch prediction) Niagara 2 Data width64-b Clock rate1.4 GHz Cache (I/D/L2) 16K/8K/4M Issue rate1 issue Pipe stages6 stages BHT entriesNone TLB entries64I/64D Memory BW60+ GB/s Transistors??? million Power (max)<95 W 8-way MT SPARC pipe Crossbar 8-way banked L2$ Memory controllers I/O shared funct’s

CSE431 Chapter 7A.34Irwin, PSU, 2008 Niagara Integer Pipeline  Cores are simple (single-issue, 6 stage, no branch prediction), small, and power-efficient FetchThrd SelDecodeExecuteMemoryWB I$ ITLB Inst bufx8 PC logicx8 Decode RegFile x8 Thread Select Logic ALU Mul Shft Div D$ DTLB Stbufx8 Thrd Sel Mux Crossbar Interface Instr type Cache misses Traps & interrupts Resource conflicts From MPR, Vol. 18, #9, Sept. 2004

CSE431 Chapter 7A.35Irwin, PSU, 2008 Simultaneous Multithreading (SMT)  A variation on multithreading that uses the resources of a multiple-issue, dynamically scheduled processor (superscalar) to exploit both program ILP and thread- level parallelism (TLP) l Most SS processors have more machine level parallelism than most programs can effectively use (i.e., than have ILP) l With register renaming and dynamic scheduling, multiple instructions from independent threads can be issued without regard to dependencies among them -Need separate rename tables (RUUs) for each thread or need to be able to indicate which thread the entry belongs to -Need the capability to commit from multiple threads in one cycle  Intel’s Pentium 4 SMT is called hyperthreading l Supports just two threads (doubles the architecture state)

CSE431 Chapter 7A.36Irwin, PSU, 2008 Threading on a 4-way SS Processor Example Thread AThread B Thread CThread D Time → Issue slots → SMTFine MTCoarse MT

CSE431 Chapter 7A.37Irwin, PSU, 2008 Threading on a 4-way SS Processor Example Thread AThread B Thread CThread D Time → Issue slots → SMTFine MTCoarse MT

CSE431 Chapter 7A.38Irwin, PSU, 2008 Review: Multiprocessor Basics # of Proc Communication model Message passing8 to 2048 Shared address NUMA8 to 256 UMA2 to 64 Physical connection Network8 to 256 Bus2 to 36  Q1 – How do they share data?  Q2 – How do they coordinate?  Q3 – How scalable is the architecture? How many processors?

CSE431 Chapter 7A.39Irwin, PSU, 2008 Next Lecture and Reminders  Next lecture l Multiprocessor architectures -Reading assignment – PH, Chapter PH  Reminders l HW5 due November 13 th l HW6 out November 13 th and due December 11 th l Check grade posting on-line (by your midterm exam number) for correctness l Second evening midterm exam scheduled -Tuesday, November 18, 20:15 to 22:15, Location 262 Willard -Please let me know ASAP (via ) if you have a conflict