Lecture 29 Computer Systems From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center for Education.

Slides:



Advertisements
Similar presentations
CS136, Advanced Architecture Limits to ILP Simultaneous Multithreading.
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 6: Multicore Systems
Multithreading processors Adapted from Bhuyan, Patterson, Eggers, probably others.
Lecture 29 Computer Arch. From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center for Education.
Parallel Programming & Cluster Computing Multicore Madness Henry Neeman, University of Oklahoma Charlie Peck, Earlham College Tuesday October
Review: Multiprocessor Basics
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.
Supercomputing in Plain English An Introduction to High Performance Computing Part II: The Tyranny of the Storage Hierarchy Henry Neeman, Director OU Supercomputing.
Supercomputing in Plain English The Tyranny of the Storage Hierarchy Henry Neeman, Director OU Supercomputing Center for Education & Research Blue Waters.
Supercomputing in Plain English The Tyranny of the Storage Hierarchy PRESENTERNAME PRESENTERTITLE PRESENTERDEPARTMENT PRESENTERINSTITUTION DAY MONTH DATE.
Lecture 30Fall 2006 Computer Architecture Fall 2006 Lecture 30. CMPs & SMTs Adapted from Mary Jane Irwin ( ) [Adapted.
1 The Storage Hierarchy Registers Cache memory Main memory (RAM) Hard disk Removable media (CD, DVD etc) Internet Fast, expensive, few Slow, cheap, a lot.
Supercomputing in Plain English Multicore Madness Blue Waters Undergraduate Petascale Education Program May 23 – June
Parallel Programming & Cluster Computing Multicore Madness Henry Neeman, Director OU Supercomputing Center for Education & Research University of Oklahoma.
Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center.
Supercomputing in Plain English Multicore Madness PRESENTERNAME PRESENTERTITLE PRESENTERDEPARTMENT PRESENTERINSTITUTION DAY MONTH DATE YEAR Your Logo Here.
Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center for Education & Research University of Oklahoma.
Supercomputing and Science An Introduction to High Performance Computing Part II: The Tyranny of the Storage Hierarchy: From Registers to the Internet.
Parallel & Cluster Computing Multicore Madness Henry Neeman, Director OU Supercomputing Center for Education & Research University of Oklahoma SC08 Education.
.1 Multiprocessor on a Chip & Simultaneous Multi-threads [Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005]
Lecture 28 Computer Arch. From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center for Education.
Processor Architecture
Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.
Parallel Programming & Cluster Computing The Tyranny of the Storage Hierarchy Henry Neeman, University of Oklahoma Paul Gray, University of Northern Iowa.
.1 Multiprocessor on a Chip & Simultaneous Multi-threads [Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005]
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
Parallel Programming & Cluster Computing The Tyranny of the Storage Hierarchy Henry Neeman, University of Oklahoma Charlie Peck, Earlham College Andrew.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
CMSC 611: Advanced Computer Architecture Memory & Virtual Memory Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material.
Introduction CSE 410, Spring 2005 Computer Systems
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
CSE431 L28 CMP&SMT.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 28. CMPs & SMTs Mary Jane Irwin ( )
COSC3330 Computer Architecture
COMP 740: Computer Architecture and Implementation
Bus Systems ISA PCI AGP.
Distributed Processors
CSE 410, Spring 2006 Computer Systems
Simultaneous Multithreading
Multi-core processors
Computer Structure Multi-Threading
Swapping Segmented paging allows us to have non-contiguous allocations
What happens inside a CPU?
Cache Memory Presentation I
Supercomputing in Plain English The Tyranny of the Storage Hierarchy
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
/ Computer Architecture and Design
Hyperthreading Technology
Lecture: SMT, Cache Hierarchies
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Levels of Parallelism within a Single Processor
Hardware Multithreading
Lecture: SMT, Cache Hierarchies
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor
Types of Computers Mainframe/Server
1.1 The Characteristics of Contemporary Processors, Input, Output and Storage Devices Types of Processors.
Lecture: SMT, Cache Hierarchies
/ Computer Architecture and Design
* From AMD 1996 Publication #18522 Revision E
EE 193: Parallel Computing
Computer Graphics Graphics Hardware
/ Computer Architecture and Design
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
CSC3050 – Computer Architecture
Levels of Parallelism within a Single Processor
Computer Systems Fall 2006 Lecture 28. CMPs & SMTs
Memory System Performance Chapter 3
EE 155 / Comp 122 Parallel Computing
Presentation transcript:

Lecture 29 Computer Systems From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center for Education & Research University of Oklahoma Wednesday October 17 2007

Supercomputing in Plain English: Multicore Madness Outline The March of Progress Multicore/Many-core Basics Software Strategies for Multicore/Many-core A Concrete Example: Weather Forecasting Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007

The March of Progress

Supercomputing in Plain English: Multicore Madness OU’s TeraFLOP Cluster, 2002 10 racks @ 1000 lbs per rack 270 Pentium4 Xeon CPUs, 2.0 GHz, 512 KB L2 cache 270 GB RAM, 400 MHz FSB 8 TB disk Myrinet2000 Interconnect 100 Mbps Ethernet Interconnect OS: Red Hat Linux Peak speed: 1.08 TFLOP/s (1.08 trillion calculations per second) One of the first Pentium4 clusters! boomer.oscer.ou.edu Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007

TeraFLOP, Prototype 2006, Sale 2011 9 years from room to chip! http://news.com.com/2300-1006_3-6119652.html Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007

Supercomputing in Plain English: Multicore Madness Moore’s Law In 1965, Gordon Moore was an engineer at Fairchild Semiconductor. He noticed that the number of transistors that could be squeezed onto a chip was doubling about every 18 months. It turns out that computer speed is roughly proportional to the number of transistors per unit area. Moore wrote a paper about this concept, which became known as “Moore’s Law.” Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007

Moore’s Law in Practice CPU log(Speed) Year Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007

Moore’s Law in Practice Network Bandwidth CPU log(Speed) Year Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007

Moore’s Law in Practice Network Bandwidth CPU log(Speed) RAM Year Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007

Moore’s Law in Practice Network Bandwidth log(Speed) CPU RAM 1/Network Latency Year Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007

Moore’s Law in Practice Network Bandwidth CPU log(Speed) RAM 1/Network Latency Software Year Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007

Fastest Supercomputer vs. Moore www.top500.org Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007

The Tyranny of the Storage Hierarchy

Supercomputing in Plain English: Multicore Madness The Storage Hierarchy Fast, expensive, few Slow, cheap, a lot [5] [6] Registers Cache memory Main memory (RAM) Hard disk Removable media (e.g., DVD) Internet Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007

Supercomputing in Plain English: Multicore Madness RAM is Slow CPU 351 GB/sec[7] The speed of data transfer between Main Memory and the CPU is much slower than the speed of calculating, so the CPU spends most of its time waiting for data to come in or go out. Bottleneck 10.66 GB/sec[9] (3%) Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007

Supercomputing in Plain English: Multicore Madness Why Have Cache? CPU 351 GB/sec[7] Cache is nearly the same speed as the CPU, so the CPU doesn’t have to wait nearly as long for stuff that’s already in cache: it can do more operations per second! 253 GB/sec[8] (72%) 10.66 GB/sec[9] (3%) Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007

Storage Use Strategies Register reuse: Do a lot of work on the same data before working on new data. Cache reuse: The program is much more efficient if all of the data and instructions fit in cache; if not, try to use what’s in cache a lot before using anything that isn’t in cache. Data locality: Try to access data that are near each other in memory before data that are far. I/O efficiency: Do a bunch of I/O all at once rather than a little bit at a time; don’t mix calculations and I/O. Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007

Supercomputing in Plain English: Multicore Madness A Concrete Example OSCER’s big cluster, topdawg, has Irwindale CPUs: single core, 3.2 GHz, 800 MHz Front Side Bus. The theoretical peak CPU speed is 6.4 GFLOPs (double precision) per CPU, and in practice we’ve gotten as high as 94% of that. So, in theory each CPU could consume 143 GB/sec. The theoretical peak RAM bandwidth is 6.4 GB/sec, but in practice we get about half that. So, any code that does less than 45 calculations per byte transferred between RAM and cache has speed limited by RAM bandwidth. Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007

Good Cache Reuse Example

Supercomputing in Plain English: Multicore Madness A Sample Application Matrix-Matrix Multiply Let A, B and C be matrices of sizes nr  nc, nr  nk and nk  nc, respectively: The definition of A = B • C is for r  {1, nr}, c  {1, nc}. Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007

Matrix Multiply: Naïve Version SUBROUTINE matrix_matrix_mult_naive (dst, src1, src2, & & nr, nc, nq) IMPLICIT NONE INTEGER,INTENT(IN) :: nr, nc, nq REAL,DIMENSION(nr,nc),INTENT(OUT) :: dst REAL,DIMENSION(nr,nq),INTENT(IN) :: src1 REAL,DIMENSION(nq,nc),INTENT(IN) :: src2 INTEGER :: r, c, q DO c = 1, nc DO r = 1, nr dst(r,c) = 0.0 DO q = 1, nq dst(r,c) = dst(r,c) + src1(r,q) * src2(q,c) END DO END SUBROUTINE matrix_matrix_mult_naive Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007

Performance of Matrix Multiply Better Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007

Supercomputing in Plain English: Multicore Madness Tiling Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007

Supercomputing in Plain English: Multicore Madness Tiling Tile: A small rectangular subdomain of a problem domain. Sometimes called a block or a chunk. Tiling: Breaking the domain into tiles. Tiling strategy: Operate on each tile to completion, then move to the next tile. Tile size can be set at runtime, according to what’s best for the machine that you’re running on. Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007

Supercomputing in Plain English: Multicore Madness Tiling Code SUBROUTINE matrix_matrix_mult_by_tiling (dst, src1, src2, nr, nc, nq, & & rtilesize, ctilesize, qtilesize) IMPLICIT NONE INTEGER,INTENT(IN) :: nr, nc, nq REAL,DIMENSION(nr,nc),INTENT(OUT) :: dst REAL,DIMENSION(nr,nq),INTENT(IN) :: src1 REAL,DIMENSION(nq,nc),INTENT(IN) :: src2 INTEGER,INTENT(IN) :: rtilesize, ctilesize, qtilesize INTEGER :: rstart, rend, cstart, cend, qstart, qend DO cstart = 1, nc, ctilesize cend = cstart + ctilesize - 1 IF (cend > nc) cend = nc DO rstart = 1, nr, rtilesize rend = rstart + rtilesize - 1 IF (rend > nr) rend = nr DO qstart = 1, nq, qtilesize qend = qstart + qtilesize - 1 IF (qend > nq) qend = nq CALL matrix_matrix_mult_tile(dst, src1, src2, nr, nc, nq, & & rstart, rend, cstart, cend, qstart, qend) END DO END SUBROUTINE matrix_matrix_mult_by_tiling Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007

Multiplying Within a Tile SUBROUTINE matrix_matrix_mult_tile (dst, src1, src2, nr, nc, nq, & & rstart, rend, cstart, cend, qstart, qend) IMPLICIT NONE INTEGER,INTENT(IN) :: nr, nc, nq REAL,DIMENSION(nr,nc),INTENT(OUT) :: dst REAL,DIMENSION(nr,nq),INTENT(IN) :: src1 REAL,DIMENSION(nq,nc),INTENT(IN) :: src2 INTEGER,INTENT(IN) :: rstart, rend, cstart, cend, qstart, qend INTEGER :: r, c, q DO c = cstart, cend DO r = rstart, rend IF (qstart == 1) dst(r,c) = 0.0 DO q = qstart, qend dst(r,c) = dst(r,c) + src1(r,q) * src2(q,c) END DO END SUBROUTINE matrix_matrix_mult_tile Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007

Reminder: Naïve Version, Again SUBROUTINE matrix_matrix_mult_naive (dst, src1, src2, & & nr, nc, nq) IMPLICIT NONE INTEGER,INTENT(IN) :: nr, nc, nq REAL,DIMENSION(nr,nc),INTENT(OUT) :: dst REAL,DIMENSION(nr,nq),INTENT(IN) :: src1 REAL,DIMENSION(nq,nc),INTENT(IN) :: src2 INTEGER :: r, c, q DO c = 1, nc DO r = 1, nr dst(r,c) = 0.0 DO q = 1, nq dst(r,c) = dst(r,c) + src1(r,q) * src2(q,c) END DO END SUBROUTINE matrix_matrix_mult_naive Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007

Performance with Tiling Better Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007

The Advantages of Tiling It allows your code to exploit data locality better, to get much more cache reuse: your code runs faster! It’s a relatively modest amount of extra coding (typically a few wrapper functions and some changes to loop bounds). If you don’t need tiling – because of the hardware, the compiler or the problem size – then you can turn it off by simply setting the tile size equal to the problem size. Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007

Why Does Tiling Work Here? Cache optimization works best when the number of calculations per byte is large. For example, with matrix-matrix multiply on an n × n matrix, there are O(n3) calculations (on the order of n3), but only O(n2) bytes of data. So, for large n, there are a huge number of calculations per byte transferred between RAM and cache. Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007

Multicore/Many-core Basics

Supercomputing in Plain English: Multicore Madness What is Multicore? In the olden days (i.e., the first half of 2005), each CPU chip had one “brain” in it. More recently, each CPU chip has 2 cores (brains), and, starting in late 2006, 4 cores. Jargon: Each CPU chip plugs into a socket, so these days, to avoid confusion, people refer to sockets and cores, rather than CPUs or processors. Each core is just like a full blown CPU, except that it shares its socket with one or more other cores – and therefore shares its bandwidth to RAM. Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007

Supercomputing in Plain English: Multicore Madness Dual Core Core Core Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007

Supercomputing in Plain English: Multicore Madness Quad Core Core Core Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007

Supercomputing in Plain English: Multicore Madness Oct Core Core Core Core Core Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007

The Challenge of Multicore: RAM Each socket has access to a certain amount of RAM, at a fixed RAM bandwidth per SOCKET. As the number of cores per socket increases, the contention for RAM bandwidth increases too. At 2 cores in a socket, this problem isn’t too bad. But at 16 or 32 or 80 cores, it’s a huge problem. So, applications that are cache optimized will get big speedups. But, applications whose performance is limited by RAM bandwidth are going to speed up only as fast as RAM bandwidth speeds up. RAM bandwidth speeds up much slower than CPU speeds up. Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007

The Challenge of Multicore: Network Each node has access to a certain number of network ports, at a fixed number of network ports per NODE. As the number of cores per node increases, the contention for network ports increases too. At 2 cores in a socket, this problem isn’t too bad. But at 16 or 32 or 80 cores, it’s a huge problem. So, applications that do minimal communication will get big speedups. But, applications whose performance is limited by the number of MPI messages are going to speed up very very little – and may even crash the node. Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007

Multicore/Many-core Problem Most multicore chip families have relatively small cache per core (e.g., 2 MB) – and this problem seems likely to remain. Small TLBs make the problem worse: 512 KB per core rather than 2 MB. So, to get good cache reuse, you need to partition algorithm so subproblem needs no more than 512 KB. Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007

The T.L.B. on a Current Chip On Intel Core Duo (“Yonah”): Cache size is 2 MB per core. Page size is 4 KB. A core’s data TLB size is 128 page table entries. Therefore, D-TLB only covers 512 KB of cache. Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007

The T.L.B. on a Current Chip On Intel Core Duo (“Yonah”): Cache size is 2 MB per core. Page size is 4 KB. A core’s data TLB size is 128 page table entries. Therefore, D-TLB only covers 512 KB of cache. The cost of a TLB miss is 49 cycles, equivalent to as many as 196 calculations! (4 FLOPs per cycle) http://www.digit-life.com/articles2/cpu/rmma-via-c7.html Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007

Supercomputing in Plain English: Multicore Madness What Do We Need? We need much bigger caches! TLB must be big enough to cover the entire cache. It’d be nice to have RAM speed increase as fast as core counts increase, but let’s not kid ourselves. Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007

To Learn More Supercomputing http://www.oscer.ou.edu/education.php Supercomputing in Plain English: Multicore Madness Wednesday October 17 2007

Computer Systems Spring 2008 Lecture 29. CMPs & SMTs Adapted from Mary Jane Irwin ( www.cse.psu.edu/~mji ) [Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005] Other handouts To handout next time

Multithreading on A Chip Find a way to “hide” true data dependency stalls, cache miss stalls, and branch stalls by finding instructions (from other process threads) that are independent of those stalling instructions Multithreading – increase the utilization of resources on a chip by allowing multiple processes (threads) to share the functional units of a single processor Processor must duplicate the state hardware for each thread – a separate register file, PC, instruction buffer, and store buffer for each thread The caches, TLBs, BHT, BTB can be shared (although the miss rates may increase if they are not sized accordingly) The memory can be shared through virtual memory mechanisms Hardware must support efficient thread context switching

Types of Multithreading on a Chip Fine-grain – switch threads on every instruction issue Round-robin thread interleaving (skipping stalled threads) Processor must be able to switch threads on every clock cycle Advantage – can hide throughput losses that come from both short and long stalls Disadvantage – slows down the execution of an individual thread since a thread that is ready to execute without stalls is delayed by instructions from other threads Coarse-grain – switches threads only on costly stalls (e.g., L2 cache misses) Advantages – thread switching doesn’t have to be essentially free and much less likely to slow down the execution of an individual thread Disadvantage – limited, due to pipeline start-up costs, in its ability to overcome throughput loss Pipeline must be flushed and refilled on thread switches

Multithreaded Example: Sun’s Niagara (UltraSparc T1) Eight fine grain multithreaded single-issue, in-order cores (no speculation, no dynamic branch prediction) Ultra III Niagara Data width 64-b Clock rate 1.2 GHz 1.0 GHz Cache (I/D/L2) 32K/64K/ (8M external) 16K/8K/3M Issue rate 4 issue 1 issue Pipe stages 14 stages 6 stages BHT entries 16K x 2-b None TLB entries 128I/512D 64I/64D Memory BW 2.4 GB/s ~20GB/s Transistors 29 million 200 million Power (max) 53 W <60 W 4-way MT SPARC pipe 4-way MT SPARC pipe 4-way MT SPARC pipe 4-way MT SPARC pipe 4-way MT SPARC pipe 4-way MT SPARC pipe 4-way MT SPARC pipe 4-way MT SPARC pipe Crossbar I/O shared funct’s L1 caches support only two coherent states: valid and invalid lines. The L1 data cache is write-through, so there is no invalid state. The L2 cache keeps a directory of all eight L1 caches and can invalidate lines that are modified (using the MESI protocol). Other notes: the UltraSparc T1 has only one FPU, making this chip pretty bad at scientific codes. 4-way banked L2$ Memory controllers

Niagara Integer Pipeline Cores are simple (single-issue, 6 stage, no branch prediction), small, and power-efficient Fetch Thrd Sel Decode Execute Memory WB RegFilex4 ALU Mul Shft Div D$ DTLB Stbufx4 Crossbar Interface Inst bufx4 Thrd Sel Mux I$ ITLB Decode Instr type No speculative execution. Since the pipeline is short and there are multiple threads per core, branch prediction is unnecessary. The core can hide the time required to fetch the new instruction stream on a taken branch by switching to another thread during the clock delay. Register has eight register windows with three read ports and two write ports. Threads are issued round-robin, but stalled threads will get priority when they are ready to resume. Thread Select Logic Cache misses Traps & interrupts Thrd Sel Mux Resource conflicts PC logicx4 From MPR, Vol. 18, #9, Sept. 2004

Simultaneous Multithreading (SMT) A variation on multithreading that uses the resources of a multiple-issue, dynamically scheduled processor (superscalar) to exploit both program ILP and thread-level parallelism (TLP) Most SS processors have more machine level parallelism than most programs can effectively use (i.e., than have ILP) With register renaming and dynamic scheduling, multiple instructions from independent threads can be issued without regard to dependencies among them Need separate rename tables (ROBs) for each thread Need the capability to commit from multiple threads (i.e., from multiple ROBs) in one cycle Intel’s Pentium 4 SMT called hyperthreading Supports just two threads (doubles the architecture state)

Threading on a 4-way SS Processor Example Coarse MT Fine MT SMT Issue slots → Thread A Thread B Time → Thread C Thread D Coarse MT takes 27 cycles to complete (Assumes that coarse MT takes one cycle start-up time (optimistic).) Fine MT takes 25 cycles to complete. SMT takes 14 cycles to complete.

Multicore Xbox360 – “Xenon” processor To provide game developers with a balanced and powerful platform Three SMT processors, 32KB L1 D$ & I$, 1MB UL2 cache 165M transistors total 3.2 Ghz Near-POWER ISA 2-issue, 21 stage pipeline, with 128 128-bit registers Weak branch prediction – supported by software hinting In order instructions Narrow cores – 2 INT units, 2 128-bit VMX units, 1 of anything else An ATI-designed 500MZ GPU w/ 512MB of DDR3DRAM 337M transistors, 10MB framebuffer 48 pixel shader cores, each with 4 ALUs Things to note: the 32-bit Power ISA supports 32 registers natively. Moving to 128 registers requires ‘cramming’ 7-bit register operands in. No one knows how they do it, but it’s quirky. The branch predictor is quite simple, and my guess is that it’s either a 1-bit predictor or a small 2-bit predictor. Microsoft has presented a number of papers on how software hinted and compiler supported branch prediction can help. A “VMX” unit is the colloquial term for the SIMD operations similar to AltiVec we see on board. This one is custom modified to support Direct3D data format packing and unpacking. Other notes: the GPU is twice as big as the CPU. The 10MB framebuffer is an off-chip high-speed memory explicitly for full-screen anti-aliasing. In FSAA, you need to do 5 reads and 1 write per pixel, which quickly floods any memory subsystem. Instead, they build it into the framebuffer itself, which is a very fast little chip that does nothing but hold the image and smooth it out.

Xenon Diagram Core 0 Core 1 Core 2 1MB UL2 512MB DRAM GPU DVD HDD Port L1D L1I Core 1 Core 2 1MB UL2 512MB DRAM GPU BIU/IO Intf 3D Core 10MB EDRAM Video Out MC0 MC1 Analog Chip XMA Dec SMC DVD HDD Port Front USBs (2) Wireless MU ports (2 USBs) Rear USB (1) Ethernet IR Audio Out Flash Systems Control Video Out It is important to note the way that data can be streamed from the L2 cache to the GPU. In particular, the L2 can have banks ‘locked’ away from normal use, and allowed for direct-FIFO access to the GPU. This allows the processor to stream data into the GPU very efficiently, without clogging up the cache, and ensuring optimal bandwidth usage. This is especially useful in "procedural synthesis", where a template object (such as a tree) is programmatically modified slightly each time it is drawn, to make it look natural. The locked cache allows FIFO streaming of such objects to the GPU without reducing available bandwidth to the processor, and without trashing the cache. Also of note is that if you run two of the three processors at full-tilt, it's just enough to feed the GPU at full-rate. The system was meant for 6 threads, four of which are graphics threads doing procedural synthesis and the like.

The PS3 “Cell” Processor Architecture Composed of a Non-SMP Architecture 234M transistors @ 4Ghz 1 Power Processing Element, 8 “Synergistic” (SIMD) PE’s 512KB L2 $ - Massively high bandwidth (200GB/s) bus connects it to everything else The PPE is strangely similar to one of the Xenon cores Almost identical, really. Slight ISA differences, and fine-grained MT instead of real SMT The real differences lie in the SPEs (21M transistors each) An attempt to ‘fix’ the memory latency problem by giving each processor complete control over it’s own 256KB “scratchpad” – 14M transistors Direct mapped for low latency 4 vector units per SPE, 1 of everything else – 7M trans. Marketing-related info: the PPE is /so/ similar to the Xenon that other than some specialized SIMD instructions, code is near compatible. (Instruction length also differs, but that's a 'minor' issue). What really matters is that Microsoft has a real leg up on the 'mental pull' to developers. The reason is that code that's developed on the Xenon will compile and run, with very few modifications, on the PPE of the Cell. As such, Xenon has 3 "PPE-style" processors, allowing the primary development path to be MS-based. After all, once you get the game working with the much more comfortable Xenon architecture, you can then try to put some rough segments onto the SPE's, and hope for some speedup. The trick is that this way, most of the development time will be in a Xenon-native development, rather than Cell-native. This gives the dev-team more time to optimize the Xenon code, and more importantly tends to increase the amount of code that will eventually run on the PPE. A full Cell development process would start with the SPE sub-programs, but since that isn't a portable development process on either the Xbox or the Revolution, MS is hoping developers won't use it. By short-circuiting the PS3 development process by providing such a compatible and comfortable platform, MS is hoping to reduce utilization of the SPEs, and over-reliance on the PPE, reducing the Cell's functional utilization.

The PS3 “Cell” Processor Architecture Marketing-related info: the PPE is /so/ similar to the Xenon that other than some specialized SIMD instructions, code is near compatible. (Instruction length also differs, but that's a 'minor' issue). What really matters is that Microsoft has a real leg up on the 'mental pull' to developers. The reason is that code that's developed on the Xenon will compile and run, with very few modifications, on the PPE of the Cell. As such, Xenon has 3 "PPE-style" processors, allowing the primary development path to be MS-based. After all, once you get the game working with the much more comfortable Xenon architecture, you can then try to put some rough segments onto the SPE's, and hope for some speedup. The trick is that this way, most of the development time will be in a Xenon-native development, rather than Cell-native. This gives the dev-team more time to optimize the Xenon code, and more importantly tends to increase the amount of code that will eventually run on the PPE. A full Cell development process would start with the SPE sub-programs, but since that isn't a portable development process on either the Xbox or the Revolution, MS is hoping developers won't use it. By short-circuiting the PS3 development process by providing such a compatible and comfortable platform, MS is hoping to reduce utilization of the SPEs, and over-reliance on the PPE, reducing the Cell's functional utilization.

How to make use of the SPEs Note that this process requires 8 SPEs, and only 7 are enabled in the PS3's Cell. As such, some routines must be run on the same SPE, resulting in lower performance. Also note that the memory subsystem on your average desktop machine is around 6.5 GB/s. The graphics memory on your high-end video card gives maybe 25GB/s. The bus transmitting all of that data gives 200GB/s, enough for the PPE and all 7 SPE's to run at 25GB/s on the "EIM" (Element Interface Bus), which allows all of this performance to happen. That bus is a 3-segment 96B/cycle bus, and really is the backbone of the design. Without it, none of this would matter.

What about the Software? Makes use of special IBM “Hypervisor” Like an OS for OS’s Runs both a real time OS (for sound) and non-real time (for things like AI) Software must be specially coded to run well The single PPE will be quickly bogged down Must make use of SPEs wherever possible This isn’t easy, by any standard What about Microsoft? Development suite identifies which 6 threads you’re expected to run Four of them are DirectX based, and handled by the OS Only need to write two threads, functionally http://ps3forums.com/showthread.php?t=22858

Next Lecture and Reminders Final is Wednesday, May 7 from 1 to 2:50 PM in ITT 322