Presentation is loading. Please wait.

Presentation is loading. Please wait.

CSE431 L28 CMP&SMT.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 28. CMPs & SMTs Mary Jane Irwin ( )www.cse.psu.edu/~mji.

Similar presentations


Presentation on theme: "CSE431 L28 CMP&SMT.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 28. CMPs & SMTs Mary Jane Irwin ( )www.cse.psu.edu/~mji."— Presentation transcript:

1 CSE431 L28 CMP&SMT.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 28. CMPs & SMTs Mary Jane Irwin ( www.cse.psu.edu/~mji )www.cse.psu.edu/~mji www.cse.psu.edu/~cg431 [Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005]

2 CSE431 L28 CMP&SMT.2Irwin, PSU, 2005 Review: Multiprocessor Basics # of Proc Communication model Message passing8 to 2048 Shared address NUMA8 to 256 UMA2 to 64 Physical connection Network8 to 256 Bus2 to 36  Q1 – How do they share data?  Q2 – How do they coordinate?  Q3 – How scalable is the architecture? How many processors?

3 CSE431 L28 CMP&SMT.3Irwin, PSU, 2005 CMP: Multiprocessors On One Chip  By placing multiple processors, their memories and the IN all on one chip, the latencies of chip-to-chip communication are drastically reduced l ARM multi-chip core Snoop Control Unit CPU L1$s CPU L1$s CPU L1$s CPU L1$s Interrupt Distributor CPU Interface CPU Interface CPU Interface CPU Interface Per-CPU aliased peripherals Configurable between 1 & 4 symmetric CPUs Private peripheral bus Configurable # of hardware intr Primary AXI R/W 64-b bus Optional AXI R/W 64-b bus I & D 64-b bus CCB Private IRQ

4 CSE431 L28 CMP&SMT.4Irwin, PSU, 2005 Multithreading on A Chip  Find a way to “hide” true data dependency stalls, cache miss stalls, and branch stalls by finding instructions (from other process threads) that are independent of those stalling instructions  Multithreading – increase the utilization of resources on a chip by allowing multiple processes (threads) to share the functional units of a single processor l Processor must duplicate the state hardware for each thread – a separate register file, PC, instruction buffer, and store buffer for each thread l The caches, TLBs, BHT, BTB can be shared (although the miss rates may increase if they are not sized accordingly) l The memory can be shared through virtual memory mechanisms l Hardware must support efficient thread context switching

5 CSE431 L28 CMP&SMT.5Irwin, PSU, 2005 Types of Multithreading  Fine-grain – switch threads on every instruction issue l Round-robin thread interleaving (skipping stalled threads) l Processor must be able to switch threads on every clock cycle l Advantage – can hide throughput losses that come from both short and long stalls l Disadvantage – slows down the execution of an individual thread since a thread that is ready to execute without stalls is delayed by instructions from other threads  Coarse-grain – switches threads only on costly stalls (e.g., L2 cache misses) l Advantages – thread switching doesn’t have to be essentially free and much less likely to slow down the execution of an individual thread l Disadvantage – limited, due to pipeline start-up costs, in its ability to overcome throughput loss -Pipeline must be flushed and refilled on thread switches

6 CSE431 L28 CMP&SMT.6Irwin, PSU, 2005 Multithreaded Example: Sun’s Niagara (UltraSparc T1)  Eight fine grain multithreaded single-issue, in-order cores (no speculation, no dynamic branch prediction) Ultra IIINiagara Data width64-b Clock rate1.2 GHz1.0 GHz Cache (I/D/L2) 32K/64K/ (8M external) 16K/8K/3M Issue rate4 issue1 issue Pipe stages14 stages6 stages BHT entries16K x 2-bNone TLB entries128I/512D64I/64D Memory BW2.4 GB/s~20GB/s Transistors29 million200 million Power (max)53 W<60 W 4-way MT SPARC pipe Crossbar 4-way banked L2$ Memory controllers I/O shared funct’s

7 CSE431 L28 CMP&SMT.7Irwin, PSU, 2005 Niagara Integer Pipeline  Cores are simple (single-issue, 6 stage, no branch prediction), small, and power-efficient FetchThrd SelDecodeExecuteMemoryWB I$ ITLB Inst bufx4 PC logicx4 Decode RegFile x4 Thread Select Logic ALU Mul Shft Div D$ DTLB Stbufx4 Thrd Sel Mux Crossbar Interface Instr type Cache misses Traps & interrupts Resource conflicts From MPR, Vol. 18, #9, Sept. 2004

8 CSE431 L28 CMP&SMT.8Irwin, PSU, 2005 Simultaneous Multithreading (SMT)  A variation on multithreading that uses the resources of a multiple-issue, dynamically scheduled processor (superscalar) to exploit both program ILP and thread- level parallelism (TLP) l Most SS processors have more machine level parallelism than most programs can effectively use (i.e., than have ILP) l With register renaming and dynamic scheduling, multiple instructions from independent threads can be issued without regard to dependencies among them -Need separate rename tables (ROBs) for each thread -Need the capability to commit from multiple threads (i.e., from multiple ROBs) in one cycle  Intel’s Pentium 4 SMT called hyperthreading l Supports just two threads (doubles the architecture state)

9 CSE431 L28 CMP&SMT.9Irwin, PSU, 2005 Threading on a 4-way SS Processor Example Thread AThread B Thread CThread D Time → Issue slots → SMTFine MTCoarse MT

10 CSE431 L28 CMP&SMT.10Irwin, PSU, 2005 Multicore Xbox360 – “Xenon” processor  To provide game developers with a balanced and powerful platform l Three SMT processors, 32KB L1 D$ & I$, 1MB UL2 cache l 165M transistors total l 3.2 Ghz Near-POWER ISA l 2-issue, 21 stage pipeline, with 128 128-bit registers l Weak branch prediction – supported by software hinting l In order instructions l Narrow cores – 2 INT units, 2 128-bit VMX units, 1 of anything else  An ATI-designed 500MZ GPU w/ 512MB of DDR3DRAM l 337M transistors, 10MB framebuffer l 48 pixel shader cores, each with 4 ALUs

11 CSE431 L28 CMP&SMT.11Irwin, PSU, 2005 Xenon Diagram Core 0 L1D L1I Core 1 L1D L1I Core 2 L1D L1I 1MB UL2 512MB DRAM GPU BIU/IO Intf 3D Core 10MB EDRAM Video Out MC0 MC1 Analog Chip XMA Dec SMC DVD HDD Port Front USBs (2) Wireless MU ports (2 USBs) Rear USB (1) Ethernet IR Audio Out Flash Systems Control Video Out

12 CSE431 L28 CMP&SMT.12Irwin, PSU, 2005 The PS3 “Cell” Processor Architecture  Composed of a Non-SMP Architecture l 234M transistors @ 4Ghz l 1 Power Processing Element, 8 “Synergistic” (SIMD) PE’s l 512KB L2 $ - Massively high bandwidth (200GB/s) bus connects it to everything else l The PPE is strangely similar to one of the Xenon cores -Almost identical, really. Slight ISA differences, and fine-grained MT instead of real SMT l The real differences lie in the SPEs (21M transistors each) -An attempt to ‘fix’ the memory latency problem by giving each processor complete control over it’s own 256KB “scratchpad” – 14M transistors –Direct mapped for low latency -4 vector units per SPE, 1 of everything else – 7M trans.

13 CSE431 L28 CMP&SMT.13Irwin, PSU, 2005 How to make use of the SPEs

14 CSE431 L28 CMP&SMT.14Irwin, PSU, 2005 What about the Software?  Makes use of special IBM “Hypervisor” l Like an OS for OS’s l Runs both a real time OS (for sound) and non-real time (for things like AI)  Software must be specially coded to run well l The single PPE will be quickly bogged down l Must make use of SPEs wherever possible l This isn’t easy, by any standard  What about Microsoft? l Development suite identifies which 6 threads you’re expected to run l Four of them are DirectX based, and handled by the OS l Only need to write two threads, functionally

15 CSE431 L28 CMP&SMT.15Irwin, PSU, 2005 Next Lecture and Reminders  Next lecture -Reading assignment – none (or all)  Reminders l Check grade posting on-line (by your midterm exam number) for correctness l Final exam (tentatively) schedule -Tuesday, December 13th, 2:30-4:20, 22 Deike


Download ppt "CSE431 L28 CMP&SMT.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 28. CMPs & SMTs Mary Jane Irwin ( )www.cse.psu.edu/~mji."

Similar presentations


Ads by Google