George Bain SCEE Technology Group

Slides:

Advertisements

Similar presentations

Memory Interleaving.

Advertisements

COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.

Lecture 12 Reduce Miss Penalty and Hit Time

CS 352: Computer Graphics Chapter 7: The Rendering Pipeline.

RealityEngine Graphics Kurt Akeley Silicon Graphics Computer Systems.

G30™ A 3D graphics accelerator for mobile devices Petri Nordlund CTO, Bitboys Oy.

Week 7 - Monday.  What did we talk about last time?  Specular shading  Aliasing and antialiasing.

Higher Computing: Unit 1: Topic 3 – Computer Performance St Andrew’s High School, Computing Department Higher Computing Topic 3 Computer Performance.

Chris Foster Brian Moore Scott Thibaudeau Overview I/O EE – Emotion Engine!! Graphics Synthesizer Comparison.

Computer Graphics Hardware Acceleration for Embedded Level Systems Brian Murray

X86 and 3D graphics. Quick Intro to 3D Graphics Glossary: –Vertex – point in 3D space –Triangle – 3 connected vertices –Object – list of triangles that.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Chapter.

Double buffer SDRAM Memory Controller Presented by: Yael Dresner Andre Steiner Instructed by: Michael Levilov Project Number: D0713.

Video on DSP and FPGA John Johansson April 12, 2004.

Chapter 7 Interupts DMA Channels Context Switching.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)

architectural overview

PlayStation 2 Architecture Irin Jose Farid Momin Quy Ngo Olivia Wong.

University of Texas at Austin CS 378 – Game Technology Don Fussell CS 378: Computer Game Technology Beyond Meshes Spring 2012.

Inside The CPU. Buses There are 3 Types of Buses There are 3 Types of Buses Address bus Address bus –between CPU and Main Memory –Carries address of where.

COOL Chips IV A High Performance 3D Graphics Rasterizer with Effective Memory Structure Woo-Chan Park, Kil-Whan Lee*, Seung-Gi Lee, Moon-Hee Choi, Won-Jong.

* Definition of -RAM (random access memory) :- -RAM is the place in a computer where the operating system, application programs & data in current use.

Emotion Engine A look at the microprocessor at the center of the PlayStation2 gaming console Charles Aldrich.

Input/Output. Input/Output Problems Wide variety of peripherals —Delivering different amounts of data —At different speeds —In different formats All slower.

© Copyright Khronos Group, Page 1 Harnessing the Horsepower of OpenGL ES Hardware Acceleration Rob Simpson, Bitboys Oy.

David Carter SCEE Technology Group

© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Principles of I/0 hardware.

1-1 Embedded Network Interface (ENI) API Concepts Shared RAM vs. FIFO modes ENI API’s.

CS 450: COMPUTER GRAPHICS REVIEW: INTRODUCTION TO COMPUTER GRAPHICS – PART 2 SPRING 2015 DR. MICHAEL J. REALE.

Copyright © 2013, SAS Institute Inc. All rights reserved. MEMORY CACHE – PERFORMANCE CONSIDERATIONS CLAIRE CATES DISTINGUISHED DEVELOPER

Output Devices. Printers Factors affecting choice Volume of output High volume require fast, heavy-duty printer Quality of print required Location of.

Main Memory CS448.

I/O management is a major component of operating system design and operation Important aspect of computer operation I/O devices vary greatly Various methods.

Computer Graphics The Rendering Pipeline - Review CO2409 Computer Graphics Week 15.

Advanced programming for 3D Graphics Applications Console developement.

Xbox MB system memory IBM 3-way symmetric core processor ATI GPU with embedded EDRAM 12x DVD Optional Hard disk.

Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.

Advanced Computer Graphics Spring 2014 K. H. Ko School of Mechatronics Gwangju Institute of Science and Technology.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.

Playstation2 Architecture Architecture Hardware Design.

Maths & Technologies for Games Graphics Optimisation - Batching CO3303 Week 5.

Basic Elements of Processor ALU Registers Internal data pahs External data paths Control Unit.

Processor Memory Processor-memory bus I/O Device Bus Adapter I/O Device I/O Device Bus Adapter I/O Device I/O Device Expansion bus I/O Bus.

What are shaders? In the field of computer graphics, a shader is a computer program that runs on the graphics processing unit(GPU) and is used to do shading.

Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.

The Graphics Pipeline Revisited Real Time Rendering Instructor: David Luebke.

Chapter 3 System Buses.  Hardwired systems are inflexible  General purpose hardware can do different tasks, given correct control signals  Instead.

Lecture Overview Shift Register Buffering Direct Memory Access.

© Copyright 3Dlabs, Page 1 - PROPRIETARY & CONFIDENTIAL Virtual Textures Texture Management in Silicon Chris Hall Director, Product Marketing 3Dlabs.

Types of RAM (Random Access Memory) Information Technology.

Memory Hierarchy— Five Ways to Reduce Miss Penalty.

Image Fusion In Real-time, on a PC. Goals Interactive display of volume data in 3D –Allow more than one data set –Allow fusion of different modalities.

Buffering Techniques Greg Stitt ECE Department University of Florida.

1 load [2], [9] Transfer contents of memory location 9 to memory location 2. Illegal instruction.

COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE

Types of RAM (Random Access Memory)

Sarah Ewen & Lionel Lemarié SCEE Technology Group

The Graphics Rendering Pipeline

Lecture 17: Case Studies Topics: case studies for virtual memory and cache hierarchies (Sections )

Direct Memory Access Disk and Network transfers: awkward timing:

Graphics Hardware: Specialty Memories, Simple Framebuffers

Lecture 13 Clipping & Scan Conversion

Direct Rambus DRAM (aka SyncLink DRAM)

Presentation transcript:

George Bain SCEE Technology Group PS2 Programming Optimisations George Bain SCEE Technology Group Good afternoon, my name is George Bain. I’ve been working in the SCEE Technology Group in London for over 5 years and have been programming PS2 a couple of months after the March 1999 announcement of the console. During these past 2.5 years the developer support teams in all 3 territories have been working hard in finding new techniques to push the PS2 technology. Today, I’ll be discussing many optimizations that developers can use to increase performance out of their game titles. March 21-22, 2003 Moscow, Russia

Topics Performance Analyser DMA Transfers Vector Units Graphics Synthesizer EE Core: CPU File loading I won’t be going into the basics, so I hope that many people in the audience have actually developed for PS2. The topics that I’ll be discussing are the PA, DMA transfers, Vector Units, GS, CPU and lastly file loading.

Performance Analyser Capture snapshot of 7 frames of bus activity EE (Core, Bus, Vu0, and Vu1) GIF and GS 7 frames of bus activity Identify bottlenecks! Also used as a Dev Kit They are currently available to developers at SCE Developer Support offices in Tokyo, London and Foster City and they will be made available to developers next year. Many of the optimizations techniques that I will be discussing today were made possible because of the PA. This amazing piece of hardware has allowed developers to see at bus cycle level how their game engines perform. I won’t be going into detail about the PA because another member of the London support team will be presenting a full PA talk tomorrow afternoon. However, the main features of the PA is that it can easily identify bottlenecks and can be used as a devkit. The PA can capture 7 frames of bus activity from the EE and GS that is latter analysed by the PA Software.

PS2 Memory CPU 32MB RDRAM 4MB Embedded Graphics Synthesizer 8K Data 8K Frame 16K Instruction 16K Scratchpad 8K Texture 4K Data Vector Unit 0 4K Instruction 16K Data Vector Unit 1 N/A

DMA 128bit Main Data BUS running at 150 MHz 32MB of RDRAM EE RDRAM to Device = 2.4GB/Sec 10 DMA Channels connected to EE devices DMAC controls data transfer to devices Data transferred in 16byte units (QuadWord) Data must be aligned on 128bit boundary

DMA Controller EE Memory 32MB GS 4MB SIF IPU cache VIF VIF GIF FPU 128bit Bus cache VIF VIF GIF FPU EE CORE VU0 VU1 Controls data transfers between main memory or SPR to EE devices Handles arbitration between different DMA channels Processes DMA Tags Stall control and MFIFO are available for DMA packets Section 5 of EE User Manual

Checking End of DMA Transfer Main BUS DMA.STR Register polling CPU BC0F Polling

Cycle Stealing Cycle Stealing ON or OFF? Release is time between two DMA slices Allow more time for CPU to access the main bus However it slows down overall DMA transfer Main Bus Activity Release Cycle VIF DMA Slice GIF DMA Slice Cycle Stealing

Memory FIFO MFIFO can buffer DMA packets if stall occurs on Drain DMA channel when VU1 or GS becomes the bottleneck Avoid Data Cache and perform memory writes to 16K SPR Scratchpad DMA provides maximum DMA transfer speed to Memory FIFO Reduce main memory consumption

GS FIFO What can cause the GS FIFO to become full? Large primitives such as a full screen sprite Multiple texture passes VIF1 DMA VU1 Run GIF to GS FIFO (GS FIFO full) GS Pixel Engines Busy GS FIFO requests data from GIF

Draining MFIFO with VIF1 What can cause the MFIFO to become full? If GS FIFO is full, GIF doesn’t request any data XGKICK instruction will stall VU1 VIF1 stalls on sync related instructions such as MSCNT and FLUSHA SPR MFIFO VIF1 VU1 GIF GS

Geometry and Texture Syncing 1.2 GB/Sec Bandwidth to GS PATH1 for Geometry and PATH3 for Textures VU1 VU1 MEM PATH 1 VIF1 FIFO MAIN BUS MAIN RAM PATH 2 PATH 3 GIF FIFO GIF GS

Texture Transfer Paths Advantages Easy to transfer textures and set other GS registers No geometry and texture data sync problems Disadvantages PATH1 will stall if PATH2 is still in progress PATH3 Parallel DMA transfers through VIF1 and GIF channels GIF can operate in 2 different modes when using IMAGE mode Avoids PATH1 stalls when operating GIF in IMT mode Sometimes difficult to synchronize geometry and texture data

GIF in Intermittent Mode What are the benefits? Allows texture transfers via the GIF while VIF1 and VU1 continue to process data What are some things I should consider? IMT Mode is good when loading large texture blocks If GIF is constantly being occupied by PATH1 then texture transfer via PATH3 is reduced Can’t draw and transfer textures at same time! Batch textures together to limit overhead!

GIF IMT Mode OFF GIF DMA GIF DMA VIF1 DMA Complete Texture Geometry VU1 Running VIF1 DMA VU1 Stalling GIF DMA Complete Geometry

GIF IMT Mode ON GIF DMA VIF1 DMA Texture VU1 Running No XGKICK Stall Geometry

Packing Texture Data Pack 4-Bit and 8-Bit texture data 32-Bit textures provide maximum transfer speed 4/8-Bit textures must be converted by the GS Consider the transfer speed and block layouts 16 and 32-Bit pixel modes have very similar speeds Size W Size H PATH3 MB/S Format 32-Bit 16-Bit 8-Bit 256 1070 1050 785 4-Bit 380 PATH2 MB/S 1090 1075 800 385

VCL Tool Application that simplifies Vu1 Programming Available for Linux and Windows Generates VSM source code Handles many tasks Dual Pipeline processing Loop unrolling Register allocation Instruction scheduling

Vu0 Usage Transferring Data to Vu0 Processing Data with Vu0 Cop2 connection you can transfer 1QW in 2Cycles DMA transfer you can transfer 1QW in 4Cycles Processing Data with Vu0 Vu0 running Micro code Triple Buffer Scratchpad memory Transfer data to Block A Process Block A and Transfer Block B Drain Block A, Process B, Transfer C

Geometry Data Transfer Reduce memory consumption and bandwidth Remember Vector Unit register VF00.w = 1.0 4QW Per Vertex 3QW Per Vertex 1.0f Z Y X A B G R 1.0f 1.0f T S Ny Nx T S A B G R Nz Z Y X 1.0f Nz Ny Nx

Compress Geometry Data use the VIF to convert integer to float use the VU to convert integer to float Vector Unpack Mode X,Y,Z 16 Bit S,T 8 Bit RGBA VU Instruction ITOF0 ITOF12 Nx,Ny,Nz ITOF15 Compress 4 QW to 1.25 QW the example dma_size.zip sample located on SCEE/SCEA websites demonstrates this technique when unpacking V4_8 be sure to set the compression to unsigned (USN=1 in UNPACK VIF code)

GS Frame Buffers Total of 4 MB of Embedded DRAM Draw, Display, Z and Texture Buffers What are some recommended buffer sizes? PAL (512 x 512), NTSC (512 x 448) Progressive scan support with full height buffers 2-Circuits of the GS to reduce interlace flicker alpha blend odd/even fields at no cost

GS Capabilities Bandwidth Drawing Speed Massive total of 48 GB/Sec Frame Buffer 38.4 GB/Sec Texture Buffer 9.6 GB/Sec Drawing Speed 16 Pixel for non-textured (2.4 Gpixels/Sec) 75M Flat shaded Triangles/Sec 8 Pixel for textured (1.2 Gpixels/Sec) 37.5M Textured and Gouraud shaded Triangles/Sec

Set-up and Rasterizing GS Pipeline Emotion Engine Host IF Set-up and Rasterizing Pixel Pipeline x 16 PCRTC Memory IF 48 GB/Sec Frame Buffer Texture Buffer VRAM 4MB Video Out

GS Frame/Z Cache Quick Page refills! 8192bits per cycle 8K page buffer refilled in 8 GS cycles 4K 4K Frame 32x32 Z 32x32 The 8k frame/z buffer cache is shared so is just 4k cache for the frame buffer and 4k cache for the z buffer. These page buffers are very quickly filled in horizontal lines. The bandwidth of the embedded GS memory is massive at 150gigabyte a second to fill a page in just 8 cycles. The result is the penalty on page miss is small due to the fast refill

Reducing Frame Page Misses Fill rate is roughly constant if varying height Wide Primitives will cause page misses Use 32 Pixel wide strips to reduce page misses Rarely drop below 1Gpixel/Sec if miss occurs Primitives using textures greater than a page size are usually more of a problem 8Bit texture page is 128x64 Since the cache is filled in horizontal lines the most efficient way to use the frame buffer is to draw tall thin primitives. Wide primitives will cause the page buffer cache to be refilled on every line so if you are drawing large sprites or clearing the screen, divide them in to page width (32pixel) vertical strips. In most case there’s not a great deal you can do about frame/z buffer page misses and the penalties rarely drop fill-rate much below the 1.2gigapixel textured fill-rate anyhow.

Texture Fill Rates Texture Page misses have biggest effect Subdivide large texture co-ordinate ranges Keep mip-maps in the same page Texture reduction reduces the fill rate 32 pixel wide strips won’t increase performance Texel read becomes bottleneck Texture expansion doesn’t affect fill rate

Fill Rate VS Triangle Size 500 1000 1500 2x2 4x4 8x8 16x16 32x32 64x64 128x128 256x256 *Texture is on cache without reducing size Fill rate Untextured Textured* You can see from this graph that frame/z page buffer penalties for large primitives don’t tend to drop the fill-rate to below that of textured primitives. In terms of pixel fill-rate the GS is at it’s most optimal drawing 32x32 triangles. Size Untextured Textured 2x2 147 74 4x4 589 299 8x8 900 627 16x16 1137 919 32x32 1295 967 64x64 1130 923 128x128 960 888 256x256 940 922

Level Of Detail Make better use of LOD! Mip Mapping 5000 polygon model may result in just 50 visible pixels once projected onto the screen there’s also no point having detailed textures that are going to be shrunk so much Mip Mapping Improve visual quality Mip maps in different pages can cause multiple texture cache reloads MIP maps are an obvious solution to avoiding texture reduction and also help visual quality. Ideally you can also use appropriate mip-map levels to cut down the amount of texture transfers for distant objects. It is however possible to incur an extra penalty with mip-maps. A large polygon with a large span of z values will cause many mip-maps to be loaded as the polygon is drawn. This can slow drawing where the mip-maps don’t all fit in a page. Since the cache is filled in horizontal lines the most efficient way to use the frame buffer is to draw tall thin primitives. Wide primitives will cause the page buffer cache to be refilled on every line so if you are drawing large sprites or clearing the screen, divide them in to page width (32pixel) vertical strips. In most case there’s not a great deal you can do about frame/z buffer page misses and the penalties rarely drop fill-rate much below the 1.2gigapixel textured fill-rate anyhow.

Multi-Pass Rendering GS Alpha Blend operation is free! Maximum textured fill rate is 1.2G Pixels/Sec Limit number of passes (4 passes = 300M P/S) Fur rendering Reduce passes when object in distance Bump-mapping is possible Technique requires full screen passes Back face cull to reduce GS stalls

GS Fog 1200 1000 800 600 Textured* Fill rate 400 200 Texture*+Fog 2x2 2x2 4x4 8x8 16x16 32x32 64x64 Fog is fairly costly dropping the fill rate by almost 1/2 on larger textured primitives. For larger primitives it would in fact be faster to use a 2nd gouraud with alpha pass to simulate the fog. 128x128 256x256 *Texture is on cache without reducing size

Alternative Fog Technique 1 Technique 2 1st pass draw a textured polygon 2nd pass alpha blend gouraud shaded polygon Technique 2 Post-process and perspective correct fogging Move bits 8-15 of Z-Buffer into Alpha of Draw Buffer Alpha blend full screen gouraud shaded polygon onto Draw Buffer

CPU Optimisations Emotion Engine Core Instruction Set FPU (Coprocessor 1) Vu0 (Coprocessor 2) 16K Instruction Cache 8K Data Cache 16K Scratch-Pad Memory Instruction Set 64Bit MIPS III and some MIPS IV 128Bit Multi-Media

Multi-Media Instructions 128-Bit Multi-Media Instructions Parallel Processing 64 bits x2, 32 bits x4, 16 bits x8, 8 bits x16 Image format conversions Sound decompressing Pack DMA packets Convert PACKED mode to REGLIST mode Smaller data, faster DMA transfers!

Use of Data Cache Data Suitable for the Data Cache Data that is frequently read or written repeatedly Data with a high degree of locality Don’t use Data Cache for Data that gets used only once Big chunks of data larger than 8K

Reduce Cache Misses Prefetch instruction to load data beforehand Reduce the size of your code for I$ Use Uncached memory for data r/w only once Performance Counter Lib to measure misses

Scratchpad Memory 16K of high-speed memory (access directly) 2 dedicated DMA Channels (toSPR/fromSPR) SPR DMA provides best throughput 100% Occupy and 85% Send Data Suitable for the SPR Frequently used data where speed is a priority Big chunks of data can be Double Buffered on SPR memory CPU will not be granted access to the main bus when SPR DMA transfer is in progress CPU can work on data in Data Cache and SPR use SPR DMA instead of memcpy()

CD/DVD Optimisations Align destination buffer on 64 Bytes Increase performance by 25%! Combine files into a PAK file to reduce files Avoid seeking when you could be reading Load the most data you can per read Combine IOP modules and load into EE Specs: Head Pos Theoretical Speed CD inner track 10x 1500Kb/sec CD outer track 24x 3600Kb/sec DVD inner track 1.6x 2000Kb/sec DVD outer track 4x 5000Kb/sec but..... We usually say CD outer 20x 3000Kb due to read errors, so drop all the numbers down as this is max they will do and if they plump for a constant speed across the disc then DVDs are around 2000- 2400Kb/sec and CD 1500-1900Kb/sec Optimisations: Memory must be aligned to 64bytes, can speed up transfer by upto 25%. Combine files into a PAK file. This improves speed if you are loading in a number of files. This is a suggestion for improving IRX module loading etc. this can speed up your loading by about 80%. Sometimes, multiple PAK files are put on the disc, these will have repeating data in certain sections, this idea means that the head movement is kept to a minimum. When using sceopen(); or other PC style loading commands make sure that you are loading in multiples of 512 as performance can be halved.

Summary PA will push developers to the limit! Parallel Texture and Geometry Transfer DMA is flexible and very powerful! Take into consideration GS page sizes Vector Unit 0 and Scratchpad memory Check assembler output of generated code

Contact Information george_bain@scee.net Website for Licensed Developers www.ps2-pro.com SCEE DevStation 2003 www.devstation.scee.com