Comparison of Next Generation Gaming Architectures Presented By Dela Tsiagbe Presented By Dela Tsiagbe.

Slides:



Advertisements
Similar presentations
Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.
Advertisements

Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.
Computer Organization and Architecture
Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.
COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Monte-Carlo method and Parallel computing  An introduction to GPU programming Mr. Fang-An Kuo, Dr. Matthew R. Smith NCHC Applied Scientific Computing.
Graphics Hardware CMSC 435/634. Transform Shade Clip Project Rasterize Texture Z-buffer Interpolate Vertex Fragment Triangle A Graphics Pipeline.
Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.
Computer Organization and Architecture
Computer Organization and Architecture
GRAPHICS AND COMPUTING GPUS Jehan-François Pâris
OS Case Study: The Xbox 360  Instructor: Rob Nash  Readings: See citations in the slides.
1 Shader Performance Analysis on a Modern GPU Architecture Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Jordi Roca, Agustín Fernández Department.
Review: Multiprocessor Basics
Chapter 12 CPU Structure and Function. Example Register Organizations.
ATI GPUs and Graphics APIs Mark Segal. ATI Hardware X1K series 8 SIMD vertex engines, 16 SIMD fragment (pixel) engines 3-component vector + scalar ALUs.
GCSE Computing - The CPU
Copyright © 2006, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners Intel® Core™ Duo Processor.
CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.
Status – Week 260 Victor Moya. Summary shSim. shSim. GPU design. GPU design. Future Work. Future Work. Rumors and News. Rumors and News. Imagine. Imagine.
PlayStation 2 Architecture Irin Jose Farid Momin Quy Ngo Olivia Wong.
GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.
AMD’s ATI GPU Radeon R700 (HD 4xxx) series Elizabeth Soechting David Chang Jessica Vasconcellos 1 CS 433 Advanced Computer Architecture May 7, 2008.
Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Emotion Engine A look at the microprocessor at the center of the PlayStation2 gaming console Charles Aldrich.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Lecture#14. Last Lecture Summary Memory Address, size What memory stores OS, Application programs, Data, Instructions Types of Memory Non Volatile and.
Havok. ©Copyright 2006 Havok.com (or its licensors). All Rights Reserved. HavokFX Next Gen Physics on ATI GPUs Andrew Bowell – Senior Engineer Peter Kipfer.
Computer Graphics Graphics Hardware
Topic:The Motorola M680X0 Family Team:Ulrike Eckardt Frederik Fleck André Kudra Jan Schuster Date:Thursday, 12/10/1998 CS-350 Computer Organization Term.
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
1/21 Cell Processor (Cell Broadband Engine Architecture) Mark Budensiek.
The Arrival of the 64bit CPUs - Itanium1 นายชนินท์วงษ์ใหญ่รหัส นายสุนัยสุขเอนกรหัส
DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan.
CIS 501: Comp. Arch. | Prof. Joe Devietti | Xbox1/PS41 CIS 501: Computer Architecture Unit 12: Putting it All Together: The Xbox One/PS4 Game Consoles.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
A Closer Look At GPUs By Kayvon Fatahalian and Mike Houston Presented by Richard Stocker.
GPU Computation Strategies & Tricks Ian Buck NVIDIA.
Xbox MB system memory IBM 3-way symmetric core processor ATI GPU with embedded EDRAM 12x DVD Optional Hard disk.
Crosscutting Issues: The Rôle of Compilers Architects must be aware of current compiler technology Compiler Architecture.
CIS 501: Comp. Arch. | Prof. Joe Devietti | Xbox1/PS41 CIS 501: Computer Architecture Unit 12: Putting it All Together: The Xbox One/PS4 Game Consoles.
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.
Sony PlayStation 3 Sony also laid out the technical specs of the device. The PlayStation 3 will feature the much-vaunted Cell processor, which will run.
Pentium Architecture Arithmetic/Logic Units (ALUs) : – There are two parallel integer instruction pipelines: u-pipeline and v-pipeline – The u-pipeline.
EKT303/4 Superscalar vs Super-pipelined.
The Alpha – Data Stream Matt Ziegler.
PART 5: (1/2) Processor Internals CHAPTER 14: INSTRUCTION-LEVEL PARALLELISM AND SUPERSCALAR PROCESSORS 1.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
My Coordinates Office EM G.27 contact time:
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
1/21 Cell Processor Systems Seminar Diana Palsetia (11/21/2006)
Xbox 360 Architecture Presenter: Ataç Deniz Oral Date: 30/11/06.
Computer Graphics Graphics Hardware
Itanium® 2 Processor Architecture
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
Computer Graphics Graphics Hardware
Computer Architecture
Mattan Erez The University of Texas at Austin
Superscalar and VLIW Architectures
CSE 502: Computer Architecture
Presentation transcript:

Comparison of Next Generation Gaming Architectures Presented By Dela Tsiagbe Presented By Dela Tsiagbe

Introduction Brief History of Gaming Platforms Brief History of Gaming Platforms Difference between consoles and personal computers Difference between consoles and personal computers Look at actual Architecture Look at actual Architecture Comparison of Vendors Comparison of Vendors Summary Summary Brief History of Gaming Platforms Brief History of Gaming Platforms Difference between consoles and personal computers Difference between consoles and personal computers Look at actual Architecture Look at actual Architecture Comparison of Vendors Comparison of Vendors Summary Summary

History of gaming Video gaming itself dates back to the 60’s and 70’s Video gaming itself dates back to the 60’s and 70’s Consoles such as Magnavox Odyssey, Atari, and Colecovison made gaming popular Consoles such as Magnavox Odyssey, Atari, and Colecovison made gaming popular NES NES Storytelling Storytelling Video gaming itself dates back to the 60’s and 70’s Video gaming itself dates back to the 60’s and 70’s Consoles such as Magnavox Odyssey, Atari, and Colecovison made gaming popular Consoles such as Magnavox Odyssey, Atari, and Colecovison made gaming popular NES NES Storytelling Storytelling

Difference between Consoles and PCs In the past it used to be true that the computing power of a PC was far more than that of a console. In the past it used to be true that the computing power of a PC was far more than that of a console. Consoles today require much more. Consoles today require much more. Most times, the type of power you get for the amount you pay for the console is more. Meaning you get more for your money when you purchase a gaming console of the same price of a PC. Most times, the type of power you get for the amount you pay for the console is more. Meaning you get more for your money when you purchase a gaming console of the same price of a PC. In the past it used to be true that the computing power of a PC was far more than that of a console. In the past it used to be true that the computing power of a PC was far more than that of a console. Consoles today require much more. Consoles today require much more. Most times, the type of power you get for the amount you pay for the console is more. Meaning you get more for your money when you purchase a gaming console of the same price of a PC. Most times, the type of power you get for the amount you pay for the console is more. Meaning you get more for your money when you purchase a gaming console of the same price of a PC.

Difference between Consoles and PCs (continued) Xbox 360 Stats Xbox 360 Stats Custom IBM PowerPC-based CPU Custom IBM PowerPC-based CPU * 3 symmetrical cores running at 3.2 GHz each * 3 symmetrical cores running at 3.2 GHz each * 2 hardware threads per core; 6 hardware threads total * 2 hardware threads per core; 6 hardware threads total * 1 VMX-128 vector unit per core; 3 total * 1 VMX-128 vector unit per core; 3 total * 128 VMX-128 registers per hardware thread * 128 VMX-128 registers per hardware thread * 1 MB L2 cache * 1 MB L2 cache CPU Game Math Performance CPU Game Math Performance * 9 billion dot product operations per second * 9 billion dot product operations per second Custom ATI Graphics Processor Custom ATI Graphics Processor * 500 MHz * 500 MHz * 10 MB embedded DRAM * 10 MB embedded DRAM * 48-way parallel floating-point dynamically-scheduled shader pipelines * 48-way parallel floating-point dynamically-scheduled shader pipelines * Unified shader architecture * Unified shader architecture Xbox 360 Stats Xbox 360 Stats Custom IBM PowerPC-based CPU Custom IBM PowerPC-based CPU * 3 symmetrical cores running at 3.2 GHz each * 3 symmetrical cores running at 3.2 GHz each * 2 hardware threads per core; 6 hardware threads total * 2 hardware threads per core; 6 hardware threads total * 1 VMX-128 vector unit per core; 3 total * 1 VMX-128 vector unit per core; 3 total * 128 VMX-128 registers per hardware thread * 128 VMX-128 registers per hardware thread * 1 MB L2 cache * 1 MB L2 cache CPU Game Math Performance CPU Game Math Performance * 9 billion dot product operations per second * 9 billion dot product operations per second Custom ATI Graphics Processor Custom ATI Graphics Processor * 500 MHz * 500 MHz * 10 MB embedded DRAM * 10 MB embedded DRAM * 48-way parallel floating-point dynamically-scheduled shader pipelines * 48-way parallel floating-point dynamically-scheduled shader pipelines * Unified shader architecture * Unified shader architecture

Difference between Consoles and PCs (continued) * PowerPC-base * PowerPC-base * 1 VMX vector unit per core * 1 VMX vector unit per core * 512KB L2 cache * 512KB L2 cache * 7 x * 7 x * 7 x 128b 128 SIMD GPRs * 7 x 128b 128 SIMD GPRs * 7 x 256KB SRAM for SPE * 7 x 256KB SRAM for SPE * * 1 of 8 SPEs reserved for redundancy total floating point performance: 218 GFLOPS * * 1 of 8 SPEs reserved for redundancy total floating point performance: 218 GFLOPS * PowerPC-base * PowerPC-base * 1 VMX vector unit per core * 1 VMX vector unit per core * 512KB L2 cache * 512KB L2 cache * 7 x * 7 x * 7 x 128b 128 SIMD GPRs * 7 x 128b 128 SIMD GPRs * 7 x 256KB SRAM for SPE * 7 x 256KB SRAM for SPE * * 1 of 8 SPEs reserved for redundancy total floating point performance: 218 GFLOPS * * 1 of 8 SPEs reserved for redundancy total floating point performance: 218 GFLOPS

Difference between Consoles and PCs (continued) Things to consider: Things to consider: Although there is less memory, there is no is a minimal OS running in the background Although there is less memory, there is no is a minimal OS running in the background Compatibility of hardware is never a problem Compatibility of hardware is never a problem There is very little overhead from the system itself. There is very little overhead from the system itself. Things to consider: Things to consider: Although there is less memory, there is no is a minimal OS running in the background Although there is less memory, there is no is a minimal OS running in the background Compatibility of hardware is never a problem Compatibility of hardware is never a problem There is very little overhead from the system itself. There is very little overhead from the system itself.

Types of processors Xbox Xenon Xbox Xenon PS3 - PowerPC Cell PS3 - PowerPC Cell Xbox Xenon Xbox Xenon PS3 - PowerPC Cell PS3 - PowerPC Cell

PS3 Schematics

Xbox 360 Schematics

Power PC Instruction Set li REG, VALUE li REG, VALUE loads register REG with the number VALUE loads register REG with the number VALUE add REGA, REGB, REGC add REGA, REGB, REGC adds REGB with REGC and stores the result in REGA adds REGB with REGC and stores the result in REGA addi REGA, REGB, VALUE addi REGA, REGB, VALUE add the number VALUE to REGB and stores the result in REGA add the number VALUE to REGB and stores the result in REGA mr REGA, REGB mr REGA, REGB copies the value in REGB into REGA copies the value in REGB into REGA or REGA, REGB, REGC or REGA, REGB, REGC performs a logical "or" between REGB and REGC, and stores the result in REGA performs a logical "or" between REGB and REGC, and stores the result in REGA ori REGA, REGB, VALUE ori REGA, REGB, VALUE performs a logical "or" between REGB and VALUE, and stores the result in REGA performs a logical "or" between REGB and VALUE, and stores the result in REGA and, andi, xor, xori, nand, nand, and nor and, andi, xor, xori, nand, nand, and nor all of these follow the same pattern as "or" and "ori" for the other logical operations all of these follow the same pattern as "or" and "ori" for the other logical operations ld REGA, 0(REGB) ld REGA, 0(REGB) li REG, VALUE li REG, VALUE loads register REG with the number VALUE loads register REG with the number VALUE add REGA, REGB, REGC add REGA, REGB, REGC adds REGB with REGC and stores the result in REGA adds REGB with REGC and stores the result in REGA addi REGA, REGB, VALUE addi REGA, REGB, VALUE add the number VALUE to REGB and stores the result in REGA add the number VALUE to REGB and stores the result in REGA mr REGA, REGB mr REGA, REGB copies the value in REGB into REGA copies the value in REGB into REGA or REGA, REGB, REGC or REGA, REGB, REGC performs a logical "or" between REGB and REGC, and stores the result in REGA performs a logical "or" between REGB and REGC, and stores the result in REGA ori REGA, REGB, VALUE ori REGA, REGB, VALUE performs a logical "or" between REGB and VALUE, and stores the result in REGA performs a logical "or" between REGB and VALUE, and stores the result in REGA and, andi, xor, xori, nand, nand, and nor and, andi, xor, xori, nand, nand, and nor all of these follow the same pattern as "or" and "ori" for the other logical operations all of these follow the same pattern as "or" and "ori" for the other logical operations ld REGA, 0(REGB) ld REGA, 0(REGB)

PowerPC Instruction Set use the contents of REGB as the memory address of the value to load into REGA use the contents of REGB as the memory address of the value to load into REGA lbz, lhz, and lwz lbz, lhz, and lwz all of these follow the same format, but operate on bytes, halfwords, and words, respectively (the "z" indicates that they also zero-out the rest of the register) all of these follow the same format, but operate on bytes, halfwords, and words, respectively (the "z" indicates that they also zero-out the rest of the register) b ADDRESS b ADDRESS jump (or branch) to the instruction at address ADDRESS jump (or branch) to the instruction at address ADDRESS bl ADDRESS bl ADDRESS subroutine call to address ADDRESS subroutine call to address ADDRESS cmpd REGA, REGB cmpd REGA, REGB compare the contents of REGA and REGB, and set the bits of the status register appropriately compare the contents of REGA and REGB, and set the bits of the status register appropriately beq ADDRESS beq ADDRESS branch to ADDRESS if the previously compared register contents were equal branch to ADDRESS if the previously compared register contents were equal bne, blt, bgt, ble, and bge bne, blt, bgt, ble, and bge all of these follow the same form, but check for inequality, less than, greater than, less than or equal to, and greater than or equal to, respectively. all of these follow the same form, but check for inequality, less than, greater than, less than or equal to, and greater than or equal to, respectively. std REGA, 0(REGB) std REGA, 0(REGB) use the contents of REGB as the memory address to save the value of REGA into use the contents of REGB as the memory address to save the value of REGA into stb, sth, and stw stb, sth, and stw use the contents of REGB as the memory address of the value to load into REGA use the contents of REGB as the memory address of the value to load into REGA lbz, lhz, and lwz lbz, lhz, and lwz all of these follow the same format, but operate on bytes, halfwords, and words, respectively (the "z" indicates that they also zero-out the rest of the register) all of these follow the same format, but operate on bytes, halfwords, and words, respectively (the "z" indicates that they also zero-out the rest of the register) b ADDRESS b ADDRESS jump (or branch) to the instruction at address ADDRESS jump (or branch) to the instruction at address ADDRESS bl ADDRESS bl ADDRESS subroutine call to address ADDRESS subroutine call to address ADDRESS cmpd REGA, REGB cmpd REGA, REGB compare the contents of REGA and REGB, and set the bits of the status register appropriately compare the contents of REGA and REGB, and set the bits of the status register appropriately beq ADDRESS beq ADDRESS branch to ADDRESS if the previously compared register contents were equal branch to ADDRESS if the previously compared register contents were equal bne, blt, bgt, ble, and bge bne, blt, bgt, ble, and bge all of these follow the same form, but check for inequality, less than, greater than, less than or equal to, and greater than or equal to, respectively. all of these follow the same form, but check for inequality, less than, greater than, less than or equal to, and greater than or equal to, respectively. std REGA, 0(REGB) std REGA, 0(REGB) use the contents of REGB as the memory address to save the value of REGA into use the contents of REGB as the memory address to save the value of REGA into stb, sth, and stw stb, sth, and stw

CPU Specs Three 3.2 GHz PowerPC cores ・ Shared 1MB L2 cache, 8-way set associative ・ Per-Core Features ミ 2-issue per cycle, in-order, decoupled Vector/Scalar issue queue Three 3.2 GHz PowerPC cores ・ Shared 1MB L2 cache, 8-way set associative ・ Per-Core Features ミ 2-issue per cycle, in-order, decoupled Vector/Scalar issue queue 2 symmetric fine grain hardware threads ミ L1 Caches: 32K 2-way I$ / 32K 4- way D$ 2 symmetric fine grain hardware threads ミ L1 Caches: 32K 2-way I$ / 32K 4- way D$ Execution pipelines ・ Branch Unit, Integer Unit, Load/Store Unit ・ VMX128 Units: Floating Point Unit, Permute Unit, Simple Unit ・ Scalar FPU ・ VMX128 enhanced for game and graphics workloads Execution pipelines ・ Branch Unit, Integer Unit, Load/Store Unit ・ VMX128 Units: Floating Point Unit, Permute Unit, Simple Unit ・ Scalar FPU ・ VMX128 enhanced for game and graphics workloads ミ All execution units 4-way SIMD ミ All execution units 4-way SIMD ミ bit vector registers per thread ミ bit vector registers per thread ミ Custom dot-product instruction ミ Custom dot-product instruction ミ Native D3D compressed data formats ミ Native D3D compressed data formats Three 3.2 GHz PowerPC cores ・ Shared 1MB L2 cache, 8-way set associative ・ Per-Core Features ミ 2-issue per cycle, in-order, decoupled Vector/Scalar issue queue Three 3.2 GHz PowerPC cores ・ Shared 1MB L2 cache, 8-way set associative ・ Per-Core Features ミ 2-issue per cycle, in-order, decoupled Vector/Scalar issue queue 2 symmetric fine grain hardware threads ミ L1 Caches: 32K 2-way I$ / 32K 4- way D$ 2 symmetric fine grain hardware threads ミ L1 Caches: 32K 2-way I$ / 32K 4- way D$ Execution pipelines ・ Branch Unit, Integer Unit, Load/Store Unit ・ VMX128 Units: Floating Point Unit, Permute Unit, Simple Unit ・ Scalar FPU ・ VMX128 enhanced for game and graphics workloads Execution pipelines ・ Branch Unit, Integer Unit, Load/Store Unit ・ VMX128 Units: Floating Point Unit, Permute Unit, Simple Unit ・ Scalar FPU ・ VMX128 enhanced for game and graphics workloads ミ All execution units 4-way SIMD ミ All execution units 4-way SIMD ミ bit vector registers per thread ミ bit vector registers per thread ミ Custom dot-product instruction ミ Custom dot-product instruction ミ Native D3D compressed data formats ミ Native D3D compressed data formats

CPU Data Streams High bandwidth data streaming support with minimal High bandwidth data streaming support with minimal cache thrashing cache thrashing – 128B cache line size (all caches) – 128B cache line size (all caches) – Flexible set locking in L2 – Flexible set locking in L2 – Write streaming: – Write streaming: L1s are write through, writes do not allocate in L1 L1s are write through, writes do not allocate in L1 4 uncacheable write gathering buffers per core 4 uncacheable write gathering buffers per core 8 cacheable, non-sequential write gathering buffers per core 8 cacheable, non-sequential write gathering buffers per core Read streaming: Read streaming: xDCBT data prefetch around L2, directly into L1 xDCBT data prefetch around L2, directly into L1 8 outstanding load/prefetches per core 8 outstanding load/prefetches per core Tight GPU data streaming integration (XPS) Tight GPU data streaming integration (XPS) XPS – “Xbox Procedural Synthesis” XPS – “Xbox Procedural Synthesis” GPU 128B read from L2 GPU 128B read from L2 GPU low latency cacheable writebacks to CPU GPU low latency cacheable writebacks to CPU GPU shares D3D compressed data formats with CPU => at least GPU shares D3D compressed data formats with CPU => at least 2x effective bus bandwidth for typical graphics data 2x effective bus bandwidth for typical graphics data High bandwidth data streaming support with minimal High bandwidth data streaming support with minimal cache thrashing cache thrashing – 128B cache line size (all caches) – 128B cache line size (all caches) – Flexible set locking in L2 – Flexible set locking in L2 – Write streaming: – Write streaming: L1s are write through, writes do not allocate in L1 L1s are write through, writes do not allocate in L1 4 uncacheable write gathering buffers per core 4 uncacheable write gathering buffers per core 8 cacheable, non-sequential write gathering buffers per core 8 cacheable, non-sequential write gathering buffers per core Read streaming: Read streaming: xDCBT data prefetch around L2, directly into L1 xDCBT data prefetch around L2, directly into L1 8 outstanding load/prefetches per core 8 outstanding load/prefetches per core Tight GPU data streaming integration (XPS) Tight GPU data streaming integration (XPS) XPS – “Xbox Procedural Synthesis” XPS – “Xbox Procedural Synthesis” GPU 128B read from L2 GPU 128B read from L2 GPU low latency cacheable writebacks to CPU GPU low latency cacheable writebacks to CPU GPU shares D3D compressed data formats with CPU => at least GPU shares D3D compressed data formats with CPU => at least 2x effective bus bandwidth for typical graphics data 2x effective bus bandwidth for typical graphics data

GPU 500 MHz graphics processor 500 MHz graphics processor – 48 parallel shader cores (ALUs); dynamically scheduled; 32bit IEEE – 48 parallel shader cores (ALUs); dynamically scheduled; 32bit IEEE FLP FLP – 24 billion shader instructions per second – 24 billion shader instructions per second Superscalar design: vector, scalar and texture ops per instruction Superscalar design: vector, scalar and texture ops per instruction – Pixel fillrate: 4 billion pixels/sec (8 per cycle); 2x for depth / stencil only – Pixel fillrate: 4 billion pixels/sec (8 per cycle); 2x for depth / stencil only AA: 16 billion samples/sec; 2x for depth / stencil only AA: 16 billion samples/sec; 2x for depth / stencil only – Geometry rate: 500 million triangles/sec – Geometry rate: 500 million triangles/sec – Texture rate: 8 billion bilinear filtered samples / sec – Texture rate: 8 billion bilinear filtered samples / sec 10 MB EDRAM  256 GB/s fill 10 MB EDRAM  256 GB/s fill Direct3D 9.0-compatible Direct3D 9.0-compatible – High-Level Shader Language (HLSL) 3.0+ support – High-Level Shader Language (HLSL) 3.0+ support Custom features Custom features – Memory export: Particle physics, Subdivision surfaces – Memory export: Particle physics, Subdivision surfaces – Tiling acceleration: Full resolution Hi-Z, Predicated Primitives – Tiling acceleration: Full resolution Hi-Z, Predicated Primitives – XPS: – XPS: CPU cores can be slaved to GPU processing CPU cores can be slaved to GPU processing GPU reads geometry data directly from L2 GPU reads geometry data directly from L2 – Hardware scaling for display resolution matching – Hardware scaling for display resolution matching 500 MHz graphics processor 500 MHz graphics processor – 48 parallel shader cores (ALUs); dynamically scheduled; 32bit IEEE – 48 parallel shader cores (ALUs); dynamically scheduled; 32bit IEEE FLP FLP – 24 billion shader instructions per second – 24 billion shader instructions per second Superscalar design: vector, scalar and texture ops per instruction Superscalar design: vector, scalar and texture ops per instruction – Pixel fillrate: 4 billion pixels/sec (8 per cycle); 2x for depth / stencil only – Pixel fillrate: 4 billion pixels/sec (8 per cycle); 2x for depth / stencil only AA: 16 billion samples/sec; 2x for depth / stencil only AA: 16 billion samples/sec; 2x for depth / stencil only – Geometry rate: 500 million triangles/sec – Geometry rate: 500 million triangles/sec – Texture rate: 8 billion bilinear filtered samples / sec – Texture rate: 8 billion bilinear filtered samples / sec 10 MB EDRAM  256 GB/s fill 10 MB EDRAM  256 GB/s fill Direct3D 9.0-compatible Direct3D 9.0-compatible – High-Level Shader Language (HLSL) 3.0+ support – High-Level Shader Language (HLSL) 3.0+ support Custom features Custom features – Memory export: Particle physics, Subdivision surfaces – Memory export: Particle physics, Subdivision surfaces – Tiling acceleration: Full resolution Hi-Z, Predicated Primitives – Tiling acceleration: Full resolution Hi-Z, Predicated Primitives – XPS: – XPS: CPU cores can be slaved to GPU processing CPU cores can be slaved to GPU processing GPU reads geometry data directly from L2 GPU reads geometry data directly from L2 – Hardware scaling for display resolution matching – Hardware scaling for display resolution matching

GPU Block Diagram

Software SMP/SMT SMP/SMT – Mainstream techniques – Mainstream techniques – Everything is simplified by being symmetric – Everything is simplified by being symmetric UMA UMA – No partitioning headaches – No partitioning headaches OS OS – All 3 cores available for game developers – All 3 cores available for game developers Standard APIs Standard APIs – Win32, OpenMP – Win32, OpenMP – Direct3D, HLSL – Direct3D, HLSL – Assembly (CPU & Shader) supported - direct hardware access – Assembly (CPU & Shader) supported - direct hardware access Standard tools Standard tools – XNA: PIX, XACT – XNA: PIX, XACT – Visual C++, works with multiple threads... – Visual C++, works with multiple threads... SMP/SMT SMP/SMT – Mainstream techniques – Mainstream techniques – Everything is simplified by being symmetric – Everything is simplified by being symmetric UMA UMA – No partitioning headaches – No partitioning headaches OS OS – All 3 cores available for game developers – All 3 cores available for game developers Standard APIs Standard APIs – Win32, OpenMP – Win32, OpenMP – Direct3D, HLSL – Direct3D, HLSL – Assembly (CPU & Shader) supported - direct hardware access – Assembly (CPU & Shader) supported - direct hardware access Standard tools Standard tools – XNA: PIX, XACT – XNA: PIX, XACT – Visual C++, works with multiple threads... – Visual C++, works with multiple threads...