Emotion Engine™ AKA the “Playstation 2” Architecture Or The progeny of a MIPS and a DSP By Idan Gazit – June 2002.

Slides:



Advertisements
Similar presentations
DSPs Vs General Purpose Microprocessors
Advertisements

Lecture 4 Introduction to Digital Signal Processors (DSPs) Dr. Konstantinos Tatas.
1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.
Session Objectives#3 COULD explain the role memory plays in computer processing SHOULD describe the purpose of a CPU and its individual components MUST.
ISA Issues; Performance Considerations. Testing / System Verilog: ECE385.
Computer Organization and Architecture
Computer Architecture and the Fetch-Execute Cycle Parallel Processor Systems.
1/1/ / faculty of Electrical Engineering eindhoven university of technology Architectures of Digital Information Systems Part 1: Interrupts and DMA dr.ir.
Microprocessors. Von Neumann architecture Data and instructions in single read/write memory Contents of memory addressable by location, independent of.
Khaled A. Al-Utaibi  Computers are Every Where  What is Computer Engineering?  Design Levels  Computer Engineering Fields  What.
Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.
Computer Organization and Architecture
Computer Organization and Architecture
1/1/ / faculty of Electrical Engineering eindhoven university of technology Introduction Part 3: Input/output and co-processors dr.ir. A.C. Verschueren.
PlayStation2 as a General Purpose Computer (The Emotion Engine vs. general PC architectures)
This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.
Chapter 5: Computer Systems Organization Invitation to Computer Science, Java Version, Third Edition.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Data Manipulation Computer System consists of the following parts:
CS 300 – Lecture 23 Intro to Computer Architecture / Assembly Language Virtual Memory Pipelining.
Chapter 4 Processor Technology and Architecture. Chapter goals Describe CPU instruction and execution cycles Explain how primitive CPU instructions are.
Chapter 12 CPU Structure and Function. Example Register Organizations.
GCSE Computing - The CPU
architectural overview
Computer Systems CS208. Major Components of a Computer System Processor (CPU) Runs program instructions Main Memory Storage for running programs and current.
PlayStation 2 Architecture Irin Jose Farid Momin Quy Ngo Olivia Wong.
0 What is a computer?  Simply put, a computer is a sophisticated electronic calculating machine that:  Accepts input information,  Processes the information.
5.1 Chaper 4 Central Processing Unit Foundations of Computer Science  Cengage Learning.
Inside The CPU. Buses There are 3 Types of Buses There are 3 Types of Buses Address bus Address bus –between CPU and Main Memory –Carries address of where.
Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
SUPERSCALAR EXECUTION. two-way superscalar The DLW-2 has two ALUs, so it’s able to execute two arithmetic instructions in parallel (hence the term two-way.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
Emotion Engine A look at the microprocessor at the center of the PlayStation2 gaming console Charles Aldrich.
Processor Structure & Operations of an Accumulator Machine
1 Copyright © 2011, Elsevier Inc. All rights Reserved. Appendix E Authors: John Hennessy & David Patterson.
Parallelism Processing more than one instruction at a time. Pipelining
Lecture#14. Last Lecture Summary Memory Address, size What memory stores OS, Application programs, Data, Instructions Types of Memory Non Volatile and.
The Computer Systems By : Prabir Nandi Computer Instructor KV Lumding.
Computing hardware CPU.
Basics and Architectures
Microcontrollers Microcontroller (MCU) – An integrated electronic computing device that includes three major components on a single chip Microprocessor.
RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696
Computers organization & Assembly Language Chapter 0 INTRODUCTION TO COMPUTING Basic Concepts.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
Computer Organization - 1. INPUT PROCESS OUTPUT List different input devices Compare the use of voice recognition as opposed to the entry of data via.
Overview of Super-Harvard Architecture (SHARC) Daniel GlickDaniel Glick – May 15, 2002 for V (Dewar)
The fetch-execute cycle. 2 VCN – ICT Department 2013 A2 Computing RegisterMeaningPurpose PCProgram Counter keeps track of where to find the next instruction.
Department of Industrial Engineering Sharif University of Technology Session# 6.
Crosscutting Issues: The Rôle of Compilers Architects must be aware of current compiler technology Compiler Architecture.
Processor Architecture
Stored Programs In today’s lesson, we will look at: what we mean by a stored program computer how computers store and run programs what we mean by the.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
Playstation2 Architecture Architecture Hardware Design.
The Central Processing Unit (CPU)
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
Interrupts and Exception Handling. Execution We are quite aware of the Fetch, Execute process of the control unit of the CPU –Fetch and instruction as.
Software Design and Development Computer Architecture Computing Science.
COMPUTER SYSTEM FUNDAMENTAL Genetic Computer School THE PROCESSING UNIT LESSON 2.
Computer Hardware What is a CPU.
CPU Lesson 2.
GCSE Computing - The CPU
Chapter 10: Computer systems (1)
Drinking from the Firehose Decode in the Mill™ CPU Architecture
GCSE Computing - The CPU
Microprocessor Architecture
Presentation transcript:

Emotion Engine™ AKA the “Playstation 2” Architecture Or The progeny of a MIPS and a DSP By Idan Gazit – June 2002

Overview Based around a modified and extended MIPS R3000 core. Based around a modified and extended MIPS R3000 core. Designed from the ground up to run “media applications” (read: games) VERY fast – but can function as a general purpose CPU Designed from the ground up to run “media applications” (read: games) VERY fast – but can function as a general purpose CPU Bears much resemblence to “DSP’s” (Digital Signal Processors) – more on this later. Bears much resemblence to “DSP’s” (Digital Signal Processors) – more on this later.

Basic Layout – Parallelism is Key! MIPS R3K CPU MIPS R3K CPU 1 FPU (Floating Point) coprocessor 1 FPU (Floating Point) coprocessor 2 VU (Vector Units) – more on this later 2 VU (Vector Units) – more on this later Graphics Interface Unit (GIF) – passes on rendered data to the Graphics Synth, which does the work of actually “drawing” it to the screen. Graphics Interface Unit (GIF) – passes on rendered data to the Graphics Synth, which does the work of actually “drawing” it to the screen. 128b wide main bus 128b wide main bus 10 Channel DMA controller 10 Channel DMA controller

Basic Layout

The Nitty-Gritty The main job of the EE is to render entire frames, the product of which is a “display list”, i.e. a list of geometry (points, polygons, textures) and where they need to be placed on the screen. The main job of the EE is to render entire frames, the product of which is a “display list”, i.e. a list of geometry (points, polygons, textures) and where they need to be placed on the screen. All of this needs to be done very fast, so note the very wide data paths (128b main bus, and additional “private” links between certain units). All of this needs to be done very fast, so note the very wide data paths (128b main bus, and additional “private” links between certain units). Also 10 channel DMA controller – CPU shouldn’t waste time on I/O. Multiple connections between different units allow for more than one I/O transaction at once, so long as they’re on different buses Also 10 channel DMA controller – CPU shouldn’t waste time on I/O. Multiple connections between different units allow for more than one I/O transaction at once, so long as they’re on different buses

The CPU Honest, it’s just a plain MIPS with some minor extensions. Honest, it’s just a plain MIPS with some minor extensions. 32x128b general purpose regs 32x128b general purpose regs 2 x 64b ALU (Arithmetic Logic Units) 2 x 64b ALU (Arithmetic Logic Units) 1 x 128b Load/Store unit (Parallelism again – load/store 4 words at once) 1 x 128b Load/Store unit (Parallelism again – load/store 4 words at once) 1 Branch execution unit 1 Branch execution unit 2 Coprocessors: FPU and VU0 – proper MIPS coprocessors controlled by COP instructions! 2 Coprocessors: FPU and VU0 – proper MIPS coprocessors controlled by COP instructions!

The CPU Able to do 2 64b integer ops per cycle, or one 64b int op and one 128b load/store. Able to do 2 64b integer ops per cycle, or one 64b int op and one 128b load/store. ALUs are interesting: they are pipelined, but can be used two ways: ALUs are interesting: they are pipelined, but can be used two ways: Separately, as in normal CPUs (2 x 64b op) Separately, as in normal CPUs (2 x 64b op) Locked, to perform a 128b instruction: Locked, to perform a 128b instruction: 16 x 8b ops in one cycle 16 x 8b ops in one cycle 8 x 16b ops in one cycle 8 x 16b ops in one cycle 4 x 32b ops in one cycle 4 x 32b ops in one cycle

The CPU Example Supported instructions: Example Supported instructions: MUL/DIV instructions MUL/DIV instructions 3-op MUL/MADD instructions 3-op MUL/MADD instructions Arithmetic ADD/SUB instructions Arithmetic ADD/SUB instructions Pack and extend instructions Pack and extend instructions Min/Max instructions Min/Max instructions Absolute instructions Absolute instructions Shift instructions Shift instructions Logical instructions Logical instructions Compare instructions Compare instructions Quadword Load/Store (remember, 128b L/S unit) Quadword Load/Store (remember, 128b L/S unit)

The CPU 8k data / 16k instruction cache, 2-way set associative 8k data / 16k instruction cache, 2-way set associative 6-stage pipeline (shallow, compared to modern PC architectures) 6-stage pipeline (shallow, compared to modern PC architectures) Speculative execution possible, but the penalty for a branch miss isn’t bad because it’s a short pipeline. Speculative execution possible, but the penalty for a branch miss isn’t bad because it’s a short pipeline. Pipeline Stages: 1. PC select 2. Instruction fetch 3. Instruction decode and register read 4. Execute 5. Cache access 6. Writeback Pipeline Stages: 1. PC select 2. Instruction fetch 3. Instruction decode and register read 4. Execute 5. Cache access 6. Writeback

The CPU 16k of SPRAM – “Scratch Pad” RAM – VERY VERY FAST. 16k of SPRAM – “Scratch Pad” RAM – VERY VERY FAST. In the CPU core. In the CPU core. What is this stuff? This is actually a very fast data cache shared by the CPU and VU0. What is this stuff? This is actually a very fast data cache shared by the CPU and VU0. The 128b “private” link between the CPU and VU0 allows VU0 to use the SPRAM and the CPU to directly reference the VU’s registers. The 128b “private” link between the CPU and VU0 allows VU0 to use the SPRAM and the CPU to directly reference the VU’s registers. Which leads us nicely to the fact that the really difficult work is performed by… Which leads us nicely to the fact that the really difficult work is performed by…

Vector Units: The heart of EE FMAC: Floating-Point Multiply-Accumulate FMAC: Floating-Point Multiply-Accumulate As it turns out, this operation is critical to 3D rendering, and is performed many times in tight loops. As it turns out, this operation is critical to 3D rendering, and is performed many times in tight loops. An obvious candidate for parallelism and pipelining! An obvious candidate for parallelism and pipelining! Between both VU’s and the FPU, a total of 10 FMAC units able to do 1 FMAC per cycle, but also other useful instructions. Between both VU’s and the FPU, a total of 10 FMAC units able to do 1 FMAC per cycle, but also other useful instructions.

Example VU “Useful Instructions” FMAC: 1 cycle FMAC: 1 cycle Min/Max: 1 cycle Min/Max: 1 cycle FDIV – another logical unit, 1 per VU FDIV – another logical unit, 1 per VU Floating-Point divide: 7 cycles Floating-Point divide: 7 cycles Square Root: 7 cycles Square Root: 7 cycles Inv Square Root: 13 cycles Inv Square Root: 13 cycles

Vector Units However, there are differences to the two VU’s and how they are utilized. However, there are differences to the two VU’s and how they are utilized. Both are VLIW – take long instructions with multiple pieces of data. Both are VLIW – take long instructions with multiple pieces of data. Processing units are split into two “working groups”: Processing units are split into two “working groups”: Group 1: CPU + FPU + VU0 “Emotion Synthesis” on diagram Group 1: CPU + FPU + VU0 “Emotion Synthesis” on diagram Group 2: VU1 + GIF “Geometry Processing” on diagram Group 2: VU1 + GIF “Geometry Processing” on diagram

Group 1 Here, the FPU and VU0 act as proper MIPS coprocessors, and are linked to the CPU by a private 128b wide bus to avoid crowding the main bus. Here, the FPU and VU0 act as proper MIPS coprocessors, and are linked to the CPU by a private 128b wide bus to avoid crowding the main bus. FPU is nothing special, just another FPU coprocessor. 1 FMAC unit, 1 FDIV unit, each identical to VU FMAC/FDIV. FPU is nothing special, just another FPU coprocessor. 1 FMAC unit, 1 FDIV unit, each identical to VU FMAC/FDIV. VU0 does the real heavy lifting when it comes to the math; the CPU acts as more of a traffic director in feeding data as fast as it can to the VU for processing. VU0 does the real heavy lifting when it comes to the math; the CPU acts as more of a traffic director in feeding data as fast as it can to the VU for processing.

Group 1 Although group 1 does geometry processing, it is also responsible for more general-purpose calculations, such as enemy AI, game physics, etc. Although group 1 does geometry processing, it is also responsible for more general-purpose calculations, such as enemy AI, game physics, etc. Therefore group 1 has the (more generalized) CPU, whereas group 2 focuses only on geometry (and has only VU1 and the GIF) Therefore group 1 has the (more generalized) CPU, whereas group 2 focuses only on geometry (and has only VU1 and the GIF) Definite hierarchy of control in group 1 – CPU controls FPU and VU0. Definite hierarchy of control in group 1 – CPU controls FPU and VU0.

Group 1 – Vector Unit 0

32 x 128b FP registers, each holds 4 x 32b single-precision FP numbers. 32 x 128b FP registers, each holds 4 x 32b single-precision FP numbers. 16 x 16b integer regs for int math 16 x 16b integer regs for int math Instructions are just standard 32b “COP” (coprocessor) instructions Instructions are just standard 32b “COP” (coprocessor) instructions Data is passed from CPU in 128b bundles, which the VIF (VU Interface) “unpacks” into 4x32b data words. Data is passed from CPU in 128b bundles, which the VIF (VU Interface) “unpacks” into 4x32b data words. 8k each for data cache/inst cache 8k each for data cache/inst cache

Group 2 Consists of VU1 and the GIF (Graphics Interface). Consists of VU1 and the GIF (Graphics Interface). VU1 acts like a standalone VLIW processor, and is not directly controlled by the CPU. VU1 acts like a standalone VLIW processor, and is not directly controlled by the CPU. Perhaps a proper name for VU1 is the “Geometry Processor” for the GIF – this is pure data processing and it has to happen quick to keep the GIF saturated with graphics to draw out to your TV. Perhaps a proper name for VU1 is the “Geometry Processor” for the GIF – this is pure data processing and it has to happen quick to keep the GIF saturated with graphics to draw out to your TV.

Group 2 – Vector Unit 1

Same general features as VU0, but some differences according to VU1’s role: Same general features as VU0, but some differences according to VU1’s role: Addition of an “EFU” (elementary functional unit) – basically one FMAC and FDIV unit doing the more rudimentary geometry calculations. Note a striking resemblence to the FPU from group 1… Addition of an “EFU” (elementary functional unit) – basically one FMAC and FDIV unit doing the more rudimentary geometry calculations. Note a striking resemblence to the FPU from group 1… 16k each of data & inst cache, up from 8k – since VU1 must handle geometry independently of the CPU, it ends up handling much more data than VU0. 16k each of data & inst cache, up from 8k – since VU1 must handle geometry independently of the CPU, it ends up handling much more data than VU0.

Group 2 – Vector Unit 1 Special direct connection between data cache and the GIF. Special direct connection between data cache and the GIF. Why is this special? VU1 can work on a display list in cache and have it sent over to the GIF by DMA. Quicker than using the main bus to shuttle data around, less dependent on CPU, and leaves the main bus free for load instructions. Why is this special? VU1 can work on a display list in cache and have it sent over to the GIF by DMA. Quicker than using the main bus to shuttle data around, less dependent on CPU, and leaves the main bus free for load instructions.

Vector Unit Comparison Designers opted for flexibility in design, and thus the architecture is slightly confusing: Designers opted for flexibility in design, and thus the architecture is slightly confusing: VU0 is a coprocessor, VU1 is a VLIW mini- processor. VU0 is a coprocessor, VU1 is a VLIW mini- processor. BUT… VU0 can be switched into VLIW- mode, where the CPU then communicates with it like VU1. (E.G. receiving 64b instruction “bundles” and parsing them with the VIF). BUT… VU0 can be switched into VLIW- mode, where the CPU then communicates with it like VU1. (E.G. receiving 64b instruction “bundles” and parsing them with the VIF).

Vector Unit Instructions We really should treat the VU’s as limited processors. We really should treat the VU’s as limited processors. Each 64b VLIW breaks down into two 32b COP instructions, an “upper” instruction and a “lower” instruction. Each 64b VLIW breaks down into two 32b COP instructions, an “upper” instruction and a “lower” instruction. The upper/lower distinction is important; the types of work they do are different The upper/lower distinction is important; the types of work they do are different

Vector Unit Instructions Upper Instructions: SIMD (Single Instruction – Multiple Data) instructions Upper Instructions: SIMD (Single Instruction – Multiple Data) instructions Aptly named – these are the “fast” multimedia instructions that do the same operation on lots and lots of data. Aptly named – these are the “fast” multimedia instructions that do the same operation on lots and lots of data. Logically, these types of instructions are handled by the special VU units: FMAC, FDIV, etc. Logically, these types of instructions are handled by the special VU units: FMAC, FDIV, etc. Note that these instructions ONLY use the “special” units in each VU. Note that these instructions ONLY use the “special” units in each VU.

Vector Unit Instructions Lower Instructions: non SIMD type Lower Instructions: non SIMD type More “utility” than processing: More “utility” than processing: Load/store instructions Load/store instructions Jump/Branch instructions Jump/Branch instructions Random Number Generation Random Number Generation EFU instructions (only in VU1, remember 1 FMAC and 1 FDIV). EFU instructions (only in VU1, remember 1 FMAC and 1 FDIV). Note that these instructions use units in the VU’s that I didn’t mention (RNG unit, Load/Store unit, etc) – they’re the more “mundane” units for the more “mundane” tasks. Note that these instructions use units in the VU’s that I didn’t mention (RNG unit, Load/Store unit, etc) – they’re the more “mundane” units for the more “mundane” tasks.

Flow of Execution So with all of this confusing flexibility, what do we get? So with all of this confusing flexibility, what do we get? Two ways of doing work: Two ways of doing work: Group 1 & Group 2 both render in parallel, both passing on display lists to the GIF Group 1 & Group 2 both render in parallel, both passing on display lists to the GIF Group 1 (CPU,VU0,FPU) prepares instructions for VU1 – load/store, branching, etc – which VU1 renders and passes on to the GIF. Group 1 (CPU,VU0,FPU) prepares instructions for VU1 – load/store, branching, etc – which VU1 renders and passes on to the GIF.

Flow of Execution Method 1: (parallel) Method 1: (parallel) Method 2: (serial) Method 2: (serial)

DSP’s, PS2’s and PC’s, oh my! Essentially, the PS2 (like DSP’s), is performing a small amount of instructions on a large amount of “uniform” data. Essentially, the PS2 (like DSP’s), is performing a small amount of instructions on a large amount of “uniform” data. Exactly the opposite of PC’s – performing large amounts of instructions on varying data. Exactly the opposite of PC’s – performing large amounts of instructions on varying data. Side-effect bonus: good “Locality of Reference” – instructions in PS2 don’t jump around much like in PC’s, therefore less chance of cache misses or branch mispredictions. Side-effect bonus: good “Locality of Reference” – instructions in PS2 don’t jump around much like in PC’s, therefore less chance of cache misses or branch mispredictions.

DSP’s, PS2’s and PC’s, oh my! Note design decisions that promote data- intensive computing: Note design decisions that promote data- intensive computing: Wide buses, and private connections between units that move a lot of data. Wide buses, and private connections between units that move a lot of data. VLIW – instructions come packaged with lots and lots of data. VLIW – instructions come packaged with lots and lots of data. Large registers and load/store units. Instructions geared towards SIMD-style (e.g. 128 bit loads 4 words of data at once.) Large registers and load/store units. Instructions geared towards SIMD-style (e.g. 128 bit loads 4 words of data at once.) MASSIVE ability to calculate inner-loop instructions (FMAC) in ONE CYCLE – 10 FMAC’s, therefore 10 of these can be done in 1 cycle. Even FDIV’s are fast (7 cycles). MASSIVE ability to calculate inner-loop instructions (FMAC) in ONE CYCLE – 10 FMAC’s, therefore 10 of these can be done in 1 cycle. Even FDIV’s are fast (7 cycles).

Conclusion Entire EE design centered around specialized-purpose: games! It can run generalized apps but with a penalty. Entire EE design centered around specialized-purpose: games! It can run generalized apps but with a penalty. How much of a penalty? Interesting question. Perhaps not much, because there is a general-purpose MIPS at the core. How much of a penalty? Interesting question. Perhaps not much, because there is a general-purpose MIPS at the core. More similar in design to a DSP – fixed small amount of instructions to be done on large amounts of uniform data. More similar in design to a DSP – fixed small amount of instructions to be done on large amounts of uniform data.

The End & References