Emotion Engine™ AKA the “Playstation 2” Architecture Or The progeny of a MIPS and a DSP By Idan Gazit – June 2002.

Emotion Engine™ AKA the “Playstation 2” Architecture Or The progeny of a MIPS and a DSP By Idan Gazit – June 2002

Overview Based around a modified and extended MIPS R3000 core. Based around a modified and extended MIPS R3000 core. Designed from the ground up to run “media applications” (read: games) VERY fast – but can function as a general purpose CPU Designed from the ground up to run “media applications” (read: games) VERY fast – but can function as a general purpose CPU Bears much resemblence to “DSP’s” (Digital Signal Processors) – more on this later. Bears much resemblence to “DSP’s” (Digital Signal Processors) – more on this later.

Basic Layout – Parallelism is Key! MIPS R3K CPU MIPS R3K CPU 1 FPU (Floating Point) coprocessor 1 FPU (Floating Point) coprocessor 2 VU (Vector Units) – more on this later 2 VU (Vector Units) – more on this later Graphics Interface Unit (GIF) – passes on rendered data to the Graphics Synth, which does the work of actually “drawing” it to the screen. Graphics Interface Unit (GIF) – passes on rendered data to the Graphics Synth, which does the work of actually “drawing” it to the screen. 128b wide main bus 128b wide main bus 10 Channel DMA controller 10 Channel DMA controller

Basic Layout

The Nitty-Gritty The main job of the EE is to render entire frames, the product of which is a “display list”, i.e. a list of geometry (points, polygons, textures) and where they need to be placed on the screen. The main job of the EE is to render entire frames, the product of which is a “display list”, i.e. a list of geometry (points, polygons, textures) and where they need to be placed on the screen. All of this needs to be done very fast, so note the very wide data paths (128b main bus, and additional “private” links between certain units). All of this needs to be done very fast, so note the very wide data paths (128b main bus, and additional “private” links between certain units). Also 10 channel DMA controller – CPU shouldn’t waste time on I/O. Multiple connections between different units allow for more than one I/O transaction at once, so long as they’re on different buses Also 10 channel DMA controller – CPU shouldn’t waste time on I/O. Multiple connections between different units allow for more than one I/O transaction at once, so long as they’re on different buses

The CPU Honest, it’s just a plain MIPS with some minor extensions. Honest, it’s just a plain MIPS with some minor extensions. 32x128b general purpose regs 32x128b general purpose regs 2 x 64b ALU (Arithmetic Logic Units) 2 x 64b ALU (Arithmetic Logic Units) 1 x 128b Load/Store unit (Parallelism again – load/store 4 words at once) 1 x 128b Load/Store unit (Parallelism again – load/store 4 words at once) 1 Branch execution unit 1 Branch execution unit 2 Coprocessors: FPU and VU0 – proper MIPS coprocessors controlled by COP instructions! 2 Coprocessors: FPU and VU0 – proper MIPS coprocessors controlled by COP instructions!

The CPU Able to do 2 64b integer ops per cycle, or one 64b int op and one 128b load/store. Able to do 2 64b integer ops per cycle, or one 64b int op and one 128b load/store. ALUs are interesting: they are pipelined, but can be used two ways: ALUs are interesting: they are pipelined, but can be used two ways: Separately, as in normal CPUs (2 x 64b op) Separately, as in normal CPUs (2 x 64b op) Locked, to perform a 128b instruction: Locked, to perform a 128b instruction: 16 x 8b ops in one cycle 16 x 8b ops in one cycle 8 x 16b ops in one cycle 8 x 16b ops in one cycle 4 x 32b ops in one cycle 4 x 32b ops in one cycle

The CPU Example Supported instructions: Example Supported instructions: MUL/DIV instructions MUL/DIV instructions 3-op MUL/MADD instructions 3-op MUL/MADD instructions Arithmetic ADD/SUB instructions Arithmetic ADD/SUB instructions Pack and extend instructions Pack and extend instructions Min/Max instructions Min/Max instructions Absolute instructions Absolute instructions Shift instructions Shift instructions Logical instructions Logical instructions Compare instructions Compare instructions Quadword Load/Store (remember, 128b L/S unit) Quadword Load/Store (remember, 128b L/S unit)

The CPU 8k data / 16k instruction cache, 2-way set associative 8k data / 16k instruction cache, 2-way set associative 6-stage pipeline (shallow, compared to modern PC architectures) 6-stage pipeline (shallow, compared to modern PC architectures) Speculative execution possible, but the penalty for a branch miss isn’t bad because it’s a short pipeline. Speculative execution possible, but the penalty for a branch miss isn’t bad because it’s a short pipeline. Pipeline Stages: 1. PC select 2. Instruction fetch 3. Instruction decode and register read 4. Execute 5. Cache access 6. Writeback Pipeline Stages: 1. PC select 2. Instruction fetch 3. Instruction decode and register read 4. Execute 5. Cache access 6. Writeback

The CPU 16k of SPRAM – “Scratch Pad” RAM – VERY VERY FAST. 16k of SPRAM – “Scratch Pad” RAM – VERY VERY FAST. In the CPU core. In the CPU core. What is this stuff? This is actually a very fast data cache shared by the CPU and VU0. What is this stuff? This is actually a very fast data cache shared by the CPU and VU0. The 128b “private” link between the CPU and VU0 allows VU0 to use the SPRAM and the CPU to directly reference the VU’s registers. The 128b “private” link between the CPU and VU0 allows VU0 to use the SPRAM and the CPU to directly reference the VU’s registers. Which leads us nicely to the fact that the really difficult work is performed by… Which leads us nicely to the fact that the really difficult work is performed by…

Vector Units: The heart of EE FMAC: Floating-Point Multiply-Accumulate FMAC: Floating-Point Multiply-Accumulate As it turns out, this operation is critical to 3D rendering, and is performed many times in tight loops. As it turns out, this operation is critical to 3D rendering, and is performed many times in tight loops. An obvious candidate for parallelism and pipelining! An obvious candidate for parallelism and pipelining! Between both VU’s and the FPU, a total of 10 FMAC units able to do 1 FMAC per cycle, but also other useful instructions. Between both VU’s and the FPU, a total of 10 FMAC units able to do 1 FMAC per cycle, but also other useful instructions.

Example VU “Useful Instructions” FMAC: 1 cycle FMAC: 1 cycle Min/Max: 1 cycle Min/Max: 1 cycle FDIV – another logical unit, 1 per VU FDIV – another logical unit, 1 per VU Floating-Point divide: 7 cycles Floating-Point divide: 7 cycles Square Root: 7 cycles Square Root: 7 cycles Inv Square Root: 13 cycles Inv Square Root: 13 cycles

Vector Units However, there are differences to the two VU’s and how they are utilized. However, there are differences to the two VU’s and how they are utilized. Both are VLIW – take long instructions with multiple pieces of data. Both are VLIW – take long instructions with multiple pieces of data. Processing units are split into two “working groups”: Processing units are split into two “working groups”: Group 1: CPU + FPU + VU0 “Emotion Synthesis” on diagram Group 1: CPU + FPU + VU0 “Emotion Synthesis” on diagram Group 2: VU1 + GIF “Geometry Processing” on diagram Group 2: VU1 + GIF “Geometry Processing” on diagram

Group 1 Here, the FPU and VU0 act as proper MIPS coprocessors, and are linked to the CPU by a private 128b wide bus to avoid crowding the main bus. Here, the FPU and VU0 act as proper MIPS coprocessors, and are linked to the CPU by a private 128b wide bus to avoid crowding the main bus. FPU is nothing special, just another FPU coprocessor. 1 FMAC unit, 1 FDIV unit, each identical to VU FMAC/FDIV. FPU is nothing special, just another FPU coprocessor. 1 FMAC unit, 1 FDIV unit, each identical to VU FMAC/FDIV. VU0 does the real heavy lifting when it comes to the math; the CPU acts as more of a traffic director in feeding data as fast as it can to the VU for processing. VU0 does the real heavy lifting when it comes to the math; the CPU acts as more of a traffic director in feeding data as fast as it can to the VU for processing.

Group 1 Although group 1 does geometry processing, it is also responsible for more general-purpose calculations, such as enemy AI, game physics, etc. Although group 1 does geometry processing, it is also responsible for more general-purpose calculations, such as enemy AI, game physics, etc. Therefore group 1 has the (more generalized) CPU, whereas group 2 focuses only on geometry (and has only VU1 and the GIF) Therefore group 1 has the (more generalized) CPU, whereas group 2 focuses only on geometry (and has only VU1 and the GIF) Definite hierarchy of control in group 1 – CPU controls FPU and VU0. Definite hierarchy of control in group 1 – CPU controls FPU and VU0.

Group 1 – Vector Unit 0

32 x 128b FP registers, each holds 4 x 32b single-precision FP numbers. 32 x 128b FP registers, each holds 4 x 32b single-precision FP numbers. 16 x 16b integer regs for int math 16 x 16b integer regs for int math Instructions are just standard 32b “COP” (coprocessor) instructions Instructions are just standard 32b “COP” (coprocessor) instructions Data is passed from CPU in 128b bundles, which the VIF (VU Interface) “unpacks” into 4x32b data words. Data is passed from CPU in 128b bundles, which the VIF (VU Interface) “unpacks” into 4x32b data words. 8k each for data cache/inst cache 8k each for data cache/inst cache

Group 2 Consists of VU1 and the GIF (Graphics Interface). Consists of VU1 and the GIF (Graphics Interface). VU1 acts like a standalone VLIW processor, and is not directly controlled by the CPU. VU1 acts like a standalone VLIW processor, and is not directly controlled by the CPU. Perhaps a proper name for VU1 is the “Geometry Processor” for the GIF – this is pure data processing and it has to happen quick to keep the GIF saturated with graphics to draw out to your TV. Perhaps a proper name for VU1 is the “Geometry Processor” for the GIF – this is pure data processing and it has to happen quick to keep the GIF saturated with graphics to draw out to your TV.

Group 2 – Vector Unit 1

Same general features as VU0, but some differences according to VU1’s role: Same general features as VU0, but some differences according to VU1’s role: Addition of an “EFU” (elementary functional unit) – basically one FMAC and FDIV unit doing the more rudimentary geometry calculations. Note a striking resemblence to the FPU from group 1… Addition of an “EFU” (elementary functional unit) – basically one FMAC and FDIV unit doing the more rudimentary geometry calculations. Note a striking resemblence to the FPU from group 1… 16k each of data & inst cache, up from 8k – since VU1 must handle geometry independently of the CPU, it ends up handling much more data than VU0. 16k each of data & inst cache, up from 8k – since VU1 must handle geometry independently of the CPU, it ends up handling much more data than VU0.

Group 2 – Vector Unit 1 Special direct connection between data cache and the GIF. Special direct connection between data cache and the GIF. Why is this special? VU1 can work on a display list in cache and have it sent over to the GIF by DMA. Quicker than using the main bus to shuttle data around, less dependent on CPU, and leaves the main bus free for load instructions. Why is this special? VU1 can work on a display list in cache and have it sent over to the GIF by DMA. Quicker than using the main bus to shuttle data around, less dependent on CPU, and leaves the main bus free for load instructions.

Vector Unit Comparison Designers opted for flexibility in design, and thus the architecture is slightly confusing: Designers opted for flexibility in design, and thus the architecture is slightly confusing: VU0 is a coprocessor, VU1 is a VLIW mini- processor. VU0 is a coprocessor, VU1 is a VLIW mini- processor. BUT… VU0 can be switched into VLIW- mode, where the CPU then communicates with it like VU1. (E.G. receiving 64b instruction “bundles” and parsing them with the VIF). BUT… VU0 can be switched into VLIW- mode, where the CPU then communicates with it like VU1. (E.G. receiving 64b instruction “bundles” and parsing them with the VIF).

Vector Unit Instructions We really should treat the VU’s as limited processors. We really should treat the VU’s as limited processors. Each 64b VLIW breaks down into two 32b COP instructions, an “upper” instruction and a “lower” instruction. Each 64b VLIW breaks down into two 32b COP instructions, an “upper” instruction and a “lower” instruction. The upper/lower distinction is important; the types of work they do are different The upper/lower distinction is important; the types of work they do are different

Vector Unit Instructions Upper Instructions: SIMD (Single Instruction – Multiple Data) instructions Upper Instructions: SIMD (Single Instruction – Multiple Data) instructions Aptly named – these are the “fast” multimedia instructions that do the same operation on lots and lots of data. Aptly named – these are the “fast” multimedia instructions that do the same operation on lots and lots of data. Logically, these types of instructions are handled by the special VU units: FMAC, FDIV, etc. Logically, these types of instructions are handled by the special VU units: FMAC, FDIV, etc. Note that these instructions ONLY use the “special” units in each VU. Note that these instructions ONLY use the “special” units in each VU.

Vector Unit Instructions Lower Instructions: non SIMD type Lower Instructions: non SIMD type More “utility” than processing: More “utility” than processing: Load/store instructions Load/store instructions Jump/Branch instructions Jump/Branch instructions Random Number Generation Random Number Generation EFU instructions (only in VU1, remember 1 FMAC and 1 FDIV). EFU instructions (only in VU1, remember 1 FMAC and 1 FDIV). Note that these instructions use units in the VU’s that I didn’t mention (RNG unit, Load/Store unit, etc) – they’re the more “mundane” units for the more “mundane” tasks. Note that these instructions use units in the VU’s that I didn’t mention (RNG unit, Load/Store unit, etc) – they’re the more “mundane” units for the more “mundane” tasks.

Flow of Execution So with all of this confusing flexibility, what do we get? So with all of this confusing flexibility, what do we get? Two ways of doing work: Two ways of doing work: Group 1 & Group 2 both render in parallel, both passing on display lists to the GIF Group 1 & Group 2 both render in parallel, both passing on display lists to the GIF Group 1 (CPU,VU0,FPU) prepares instructions for VU1 – load/store, branching, etc – which VU1 renders and passes on to the GIF. Group 1 (CPU,VU0,FPU) prepares instructions for VU1 – load/store, branching, etc – which VU1 renders and passes on to the GIF.

Flow of Execution Method 1: (parallel) Method 1: (parallel) Method 2: (serial) Method 2: (serial)

DSP’s, PS2’s and PC’s, oh my! Essentially, the PS2 (like DSP’s), is performing a small amount of instructions on a large amount of “uniform” data. Essentially, the PS2 (like DSP’s), is performing a small amount of instructions on a large amount of “uniform” data. Exactly the opposite of PC’s – performing large amounts of instructions on varying data. Exactly the opposite of PC’s – performing large amounts of instructions on varying data. Side-effect bonus: good “Locality of Reference” – instructions in PS2 don’t jump around much like in PC’s, therefore less chance of cache misses or branch mispredictions. Side-effect bonus: good “Locality of Reference” – instructions in PS2 don’t jump around much like in PC’s, therefore less chance of cache misses or branch mispredictions.

DSP’s, PS2’s and PC’s, oh my! Note design decisions that promote data- intensive computing: Note design decisions that promote data- intensive computing: Wide buses, and private connections between units that move a lot of data. Wide buses, and private connections between units that move a lot of data. VLIW – instructions come packaged with lots and lots of data. VLIW – instructions come packaged with lots and lots of data. Large registers and load/store units. Instructions geared towards SIMD-style (e.g. 128 bit loads 4 words of data at once.) Large registers and load/store units. Instructions geared towards SIMD-style (e.g. 128 bit loads 4 words of data at once.) MASSIVE ability to calculate inner-loop instructions (FMAC) in ONE CYCLE – 10 FMAC’s, therefore 10 of these can be done in 1 cycle. Even FDIV’s are fast (7 cycles). MASSIVE ability to calculate inner-loop instructions (FMAC) in ONE CYCLE – 10 FMAC’s, therefore 10 of these can be done in 1 cycle. Even FDIV’s are fast (7 cycles).

Conclusion Entire EE design centered around specialized-purpose: games! It can run generalized apps but with a penalty. Entire EE design centered around specialized-purpose: games! It can run generalized apps but with a penalty. How much of a penalty? Interesting question. Perhaps not much, because there is a general-purpose MIPS at the core. How much of a penalty? Interesting question. Perhaps not much, because there is a general-purpose MIPS at the core. More similar in design to a DSP – fixed small amount of instructions to be done on large amounts of uniform data. More similar in design to a DSP – fixed small amount of instructions to be done on large amounts of uniform data.

The End & References http://www.arstechnica.com/reviews/1q00/playstation2/ee-1.html http://www.arstechnica.com/reviews/1q00/playstation2/ee-1.html http://www.arstechnica.com/cpu/2q00/ps2/ps2vspc-1.html http://www.arstechnica.com/cpu/2q00/ps2/ps2vspc-1.html http://www.scea.com/news/press_example.asp?ps2=ps2&ReleaseID=9 http://www.scea.com/news/press_example.asp?ps2=ps2&ReleaseID=9 http://users.ece.gatech.edu/~scotty/7102/pres/5 http://users.ece.gatech.edu/~scotty/7102/pres/5 http://www.eecg.toronto.edu/~stoodla/processors/Sony/EmotionEngine.html http://www.eecg.toronto.edu/~stoodla/processors/Sony/EmotionEngine.html http://ntsrv2000.educ.ualberta.ca/nethowto/examples/m_ho/ps2eengine.html http://ntsrv2000.educ.ualberta.ca/nethowto/examples/m_ho/ps2eengine.html http://www.geocities.com/SiliconValley/Bay/6114/cpu2.html http://www.geocities.com/SiliconValley/Bay/6114/cpu2.html

Emotion Engine™ AKA the “Playstation 2” Architecture Or The progeny of a MIPS and a DSP By Idan Gazit – June 2002.

Similar presentations

Presentation on theme: "Emotion Engine™ AKA the “Playstation 2” Architecture Or The progeny of a MIPS and a DSP By Idan Gazit – June 2002."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Emotion Engine™ AKA the “Playstation 2” Architecture Or The progeny of a MIPS and a DSP By Idan Gazit – June 2002.

Similar presentations

Presentation on theme: "Emotion Engine™ AKA the “Playstation 2” Architecture Or The progeny of a MIPS and a DSP By Idan Gazit – June 2002."— Presentation transcript:

Similar presentations

About project

Feedback