Presentation is loading. Please wait.

Presentation is loading. Please wait.

George Bain SCEE Technology Group

Similar presentations


Presentation on theme: "George Bain SCEE Technology Group"— Presentation transcript:

1 George Bain SCEE Technology Group
PS2 Programming Optimisations George Bain SCEE Technology Group Good afternoon, my name is George Bain. I’ve been working in the SCEE Technology Group in London for over 5 years and have been programming PS2 a couple of months after the March 1999 announcement of the console. During these past 2.5 years the developer support teams in all 3 territories have been working hard in finding new techniques to push the PS2 technology. Today, I’ll be discussing many optimizations that developers can use to increase performance out of their game titles. March 21-22, 2003 Moscow, Russia

2 Topics Performance Analyser DMA Transfers Vector Units
Graphics Synthesizer EE Core: CPU File loading I won’t be going into the basics, so I hope that many people in the audience have actually developed for PS2. The topics that I’ll be discussing are the PA, DMA transfers, Vector Units, GS, CPU and lastly file loading.

3 Performance Analyser Capture snapshot of 7 frames of bus activity
EE (Core, Bus, Vu0, and Vu1) GIF and GS 7 frames of bus activity Identify bottlenecks! Also used as a Dev Kit They are currently available to developers at SCE Developer Support offices in Tokyo, London and Foster City and they will be made available to developers next year. Many of the optimizations techniques that I will be discussing today were made possible because of the PA. This amazing piece of hardware has allowed developers to see at bus cycle level how their game engines perform. I won’t be going into detail about the PA because another member of the London support team will be presenting a full PA talk tomorrow afternoon. However, the main features of the PA is that it can easily identify bottlenecks and can be used as a devkit. The PA can capture 7 frames of bus activity from the EE and GS that is latter analysed by the PA Software.

4 PS2 Memory CPU 32MB RDRAM 4MB Embedded Graphics Synthesizer 8K Data
8K Frame 16K Instruction 16K Scratchpad 8K Texture 4K Data Vector Unit 0 4K Instruction 16K Data Vector Unit 1 N/A

5 DMA 128bit Main Data BUS running at 150 MHz 32MB of RDRAM
EE RDRAM to Device = 2.4GB/Sec 10 DMA Channels connected to EE devices DMAC controls data transfer to devices Data transferred in 16byte units (QuadWord) Data must be aligned on 128bit boundary

6 DMA Controller EE Memory 32MB GS 4MB SIF IPU cache VIF VIF GIF FPU
128bit Bus cache VIF VIF GIF FPU EE CORE VU0 VU1 Controls data transfers between main memory or SPR to EE devices Handles arbitration between different DMA channels Processes DMA Tags Stall control and MFIFO are available for DMA packets Section 5 of EE User Manual

7 Checking End of DMA Transfer
Main BUS DMA.STR Register polling CPU BC0F Polling

8 Cycle Stealing Cycle Stealing ON or OFF?
Release is time between two DMA slices Allow more time for CPU to access the main bus However it slows down overall DMA transfer Main Bus Activity Release Cycle VIF DMA Slice GIF DMA Slice Cycle Stealing

9 Memory FIFO MFIFO can buffer DMA packets if stall occurs on Drain DMA channel when VU1 or GS becomes the bottleneck Avoid Data Cache and perform memory writes to 16K SPR Scratchpad DMA provides maximum DMA transfer speed to Memory FIFO Reduce main memory consumption

10 GS FIFO What can cause the GS FIFO to become full?
Large primitives such as a full screen sprite Multiple texture passes VIF1 DMA VU1 Run GIF to GS FIFO (GS FIFO full) GS Pixel Engines Busy GS FIFO requests data from GIF

11 Draining MFIFO with VIF1
What can cause the MFIFO to become full? If GS FIFO is full, GIF doesn’t request any data XGKICK instruction will stall VU1 VIF1 stalls on sync related instructions such as MSCNT and FLUSHA SPR MFIFO VIF1 VU1 GIF GS

12 Geometry and Texture Syncing
1.2 GB/Sec Bandwidth to GS PATH1 for Geometry and PATH3 for Textures VU1 VU1 MEM PATH 1 VIF1 FIFO MAIN BUS MAIN RAM PATH 2 PATH 3 GIF FIFO GIF GS

13 Texture Transfer Paths
Advantages Easy to transfer textures and set other GS registers No geometry and texture data sync problems Disadvantages PATH1 will stall if PATH2 is still in progress PATH3 Parallel DMA transfers through VIF1 and GIF channels GIF can operate in 2 different modes when using IMAGE mode Avoids PATH1 stalls when operating GIF in IMT mode Sometimes difficult to synchronize geometry and texture data

14 GIF in Intermittent Mode
What are the benefits? Allows texture transfers via the GIF while VIF1 and VU1 continue to process data What are some things I should consider? IMT Mode is good when loading large texture blocks If GIF is constantly being occupied by PATH1 then texture transfer via PATH3 is reduced Can’t draw and transfer textures at same time! Batch textures together to limit overhead!

15 GIF IMT Mode OFF GIF DMA GIF DMA VIF1 DMA Complete Texture Geometry
VU1 Running VIF1 DMA VU1 Stalling GIF DMA Complete Geometry

16 GIF IMT Mode ON GIF DMA VIF1 DMA Texture VU1 Running No XGKICK Stall
Geometry

17 Packing Texture Data Pack 4-Bit and 8-Bit texture data
32-Bit textures provide maximum transfer speed 4/8-Bit textures must be converted by the GS Consider the transfer speed and block layouts 16 and 32-Bit pixel modes have very similar speeds Size W Size H PATH3 MB/S Format 32-Bit 16-Bit 8-Bit 256 1070 1050 785 4-Bit 380 PATH2 MB/S 1090 1075 800 385

18 VCL Tool Application that simplifies Vu1 Programming
Available for Linux and Windows Generates VSM source code Handles many tasks Dual Pipeline processing Loop unrolling Register allocation Instruction scheduling

19 Vu0 Usage Transferring Data to Vu0 Processing Data with Vu0
Cop2 connection you can transfer 1QW in 2Cycles DMA transfer you can transfer 1QW in 4Cycles Processing Data with Vu0 Vu0 running Micro code Triple Buffer Scratchpad memory Transfer data to Block A Process Block A and Transfer Block B Drain Block A, Process B, Transfer C

20 Geometry Data Transfer
Reduce memory consumption and bandwidth Remember Vector Unit register VF00.w = 1.0 4QW Per Vertex 3QW Per Vertex 1.0f Z Y X A B G R 1.0f 1.0f T S Ny Nx T S A B G R Nz Z Y X 1.0f Nz Ny Nx

21 Compress Geometry Data
use the VIF to convert integer to float use the VU to convert integer to float Vector Unpack Mode X,Y,Z 16 Bit S,T 8 Bit RGBA VU Instruction ITOF0 ITOF12 Nx,Ny,Nz ITOF15 Compress 4 QW to 1.25 QW the example dma_size.zip sample located on SCEE/SCEA websites demonstrates this technique when unpacking V4_8 be sure to set the compression to unsigned (USN=1 in UNPACK VIF code)

22 GS Frame Buffers Total of 4 MB of Embedded DRAM
Draw, Display, Z and Texture Buffers What are some recommended buffer sizes? PAL (512 x 512), NTSC (512 x 448) Progressive scan support with full height buffers 2-Circuits of the GS to reduce interlace flicker alpha blend odd/even fields at no cost

23 GS Capabilities Bandwidth Drawing Speed Massive total of 48 GB/Sec
Frame Buffer 38.4 GB/Sec Texture Buffer 9.6 GB/Sec Drawing Speed 16 Pixel for non-textured (2.4 Gpixels/Sec) 75M Flat shaded Triangles/Sec 8 Pixel for textured (1.2 Gpixels/Sec) 37.5M Textured and Gouraud shaded Triangles/Sec

24 Set-up and Rasterizing
GS Pipeline Emotion Engine Host IF Set-up and Rasterizing Pixel Pipeline x 16 PCRTC Memory IF 48 GB/Sec Frame Buffer Texture Buffer VRAM 4MB Video Out

25 GS Frame/Z Cache Quick Page refills! 8192bits per cycle
8K page buffer refilled in 8 GS cycles 4K 4K Frame 32x32 Z 32x32 The 8k frame/z buffer cache is shared so is just 4k cache for the frame buffer and 4k cache for the z buffer. These page buffers are very quickly filled in horizontal lines. The bandwidth of the embedded GS memory is massive at 150gigabyte a second to fill a page in just 8 cycles. The result is the penalty on page miss is small due to the fast refill

26 Reducing Frame Page Misses
Fill rate is roughly constant if varying height Wide Primitives will cause page misses Use 32 Pixel wide strips to reduce page misses Rarely drop below 1Gpixel/Sec if miss occurs Primitives using textures greater than a page size are usually more of a problem 8Bit texture page is 128x64 Since the cache is filled in horizontal lines the most efficient way to use the frame buffer is to draw tall thin primitives. Wide primitives will cause the page buffer cache to be refilled on every line so if you are drawing large sprites or clearing the screen, divide them in to page width (32pixel) vertical strips. In most case there’s not a great deal you can do about frame/z buffer page misses and the penalties rarely drop fill-rate much below the 1.2gigapixel textured fill-rate anyhow.

27 Texture Fill Rates Texture Page misses have biggest effect
Subdivide large texture co-ordinate ranges Keep mip-maps in the same page Texture reduction reduces the fill rate 32 pixel wide strips won’t increase performance Texel read becomes bottleneck Texture expansion doesn’t affect fill rate

28 Fill Rate VS Triangle Size
500 1000 1500 2x2 4x4 8x8 16x16 32x32 64x64 128x128 256x256 *Texture is on cache without reducing size Fill rate Untextured Textured* You can see from this graph that frame/z page buffer penalties for large primitives don’t tend to drop the fill-rate to below that of textured primitives. In terms of pixel fill-rate the GS is at it’s most optimal drawing 32x32 triangles. Size Untextured Textured 2x 4x 8x 16x 32x 64x 128x 256x

29 Level Of Detail Make better use of LOD! Mip Mapping
5000 polygon model may result in just 50 visible pixels once projected onto the screen there’s also no point having detailed textures that are going to be shrunk so much Mip Mapping Improve visual quality Mip maps in different pages can cause multiple texture cache reloads MIP maps are an obvious solution to avoiding texture reduction and also help visual quality. Ideally you can also use appropriate mip-map levels to cut down the amount of texture transfers for distant objects. It is however possible to incur an extra penalty with mip-maps. A large polygon with a large span of z values will cause many mip-maps to be loaded as the polygon is drawn. This can slow drawing where the mip-maps don’t all fit in a page. Since the cache is filled in horizontal lines the most efficient way to use the frame buffer is to draw tall thin primitives. Wide primitives will cause the page buffer cache to be refilled on every line so if you are drawing large sprites or clearing the screen, divide them in to page width (32pixel) vertical strips. In most case there’s not a great deal you can do about frame/z buffer page misses and the penalties rarely drop fill-rate much below the 1.2gigapixel textured fill-rate anyhow.

30 Multi-Pass Rendering GS Alpha Blend operation is free!
Maximum textured fill rate is 1.2G Pixels/Sec Limit number of passes (4 passes = 300M P/S) Fur rendering Reduce passes when object in distance Bump-mapping is possible Technique requires full screen passes Back face cull to reduce GS stalls

31 GS Fog 1200 1000 800 600 Textured* Fill rate 400 200 Texture*+Fog 2x2
2x2 4x4 8x8 16x16 32x32 64x64 Fog is fairly costly dropping the fill rate by almost 1/2 on larger textured primitives. For larger primitives it would in fact be faster to use a 2nd gouraud with alpha pass to simulate the fog. 128x128 256x256 *Texture is on cache without reducing size

32 Alternative Fog Technique 1 Technique 2
1st pass draw a textured polygon 2nd pass alpha blend gouraud shaded polygon Technique 2 Post-process and perspective correct fogging Move bits 8-15 of Z-Buffer into Alpha of Draw Buffer Alpha blend full screen gouraud shaded polygon onto Draw Buffer

33 CPU Optimisations Emotion Engine Core Instruction Set
FPU (Coprocessor 1) Vu0 (Coprocessor 2) 16K Instruction Cache 8K Data Cache 16K Scratch-Pad Memory Instruction Set 64Bit MIPS III and some MIPS IV 128Bit Multi-Media

34 Multi-Media Instructions
128-Bit Multi-Media Instructions Parallel Processing 64 bits x2, 32 bits x4, 16 bits x8, 8 bits x16 Image format conversions Sound decompressing Pack DMA packets Convert PACKED mode to REGLIST mode Smaller data, faster DMA transfers!

35 Use of Data Cache Data Suitable for the Data Cache
Data that is frequently read or written repeatedly Data with a high degree of locality Don’t use Data Cache for Data that gets used only once Big chunks of data larger than 8K

36 Reduce Cache Misses Prefetch instruction to load data beforehand
Reduce the size of your code for I$ Use Uncached memory for data r/w only once Performance Counter Lib to measure misses

37 Scratchpad Memory 16K of high-speed memory (access directly)
2 dedicated DMA Channels (toSPR/fromSPR) SPR DMA provides best throughput 100% Occupy and 85% Send Data Suitable for the SPR Frequently used data where speed is a priority Big chunks of data can be Double Buffered on SPR memory CPU will not be granted access to the main bus when SPR DMA transfer is in progress CPU can work on data in Data Cache and SPR use SPR DMA instead of memcpy()

38 CD/DVD Optimisations Align destination buffer on 64 Bytes
Increase performance by 25%! Combine files into a PAK file to reduce files Avoid seeking when you could be reading Load the most data you can per read Combine IOP modules and load into EE Specs: Head Pos Theoretical Speed CD inner track 10x 1500Kb/sec CD outer track 24x 3600Kb/sec DVD inner track 1.6x 2000Kb/sec DVD outer track 4x 5000Kb/sec but..... We usually say CD outer 20x 3000Kb due to read errors, so drop all the numbers down as this is max they will do and if they plump for a constant speed across the disc then DVDs are around Kb/sec and CD Kb/sec Optimisations: Memory must be aligned to 64bytes, can speed up transfer by upto 25%. Combine files into a PAK file. This improves speed if you are loading in a number of files. This is a suggestion for improving IRX module loading etc. this can speed up your loading by about 80%. Sometimes, multiple PAK files are put on the disc, these will have repeating data in certain sections, this idea means that the head movement is kept to a minimum. When using sceopen(); or other PC style loading commands make sure that you are loading in multiples of 512 as performance can be halved.

39 Summary PA will push developers to the limit!
Parallel Texture and Geometry Transfer DMA is flexible and very powerful! Take into consideration GS page sizes Vector Unit 0 and Scratchpad memory Check assembler output of generated code

40 Contact Information Website for Licensed Developers SCEE DevStation 2003


Download ppt "George Bain SCEE Technology Group"

Similar presentations


Ads by Google