Optimizing Ray Tracing on the Cell Microprocessor David Oguns
Agenda Goals and Motivation What is SIMD? Cell Architectural Overview Implementation & Challenges Demo Performance Results Future Improvements Questions?
Goals and Motivation <3 Performance Cell is a radically new architecture Not just multicore programming Get my hands dirty with SIMD Learning
Single Instruction Multiple Data Normal processors execute a single instruction for a single result. (Single Instruction Single Data) SIMD processors execute a single instruction across multiple data for multiple results. SIMD processing sometimes called vector/stream/dsp processors Useful in graphics acceleration, signal processing, and simulation applications.
SISD vs SIMD Approach A = 1 + 2; B = 3 + 4; C = 5 + 6; D = 7 + 8; V = { 1, 3, 5, 7} + { 2, 4, 6, 8};
SIMD Continued... Desktop CPUs usually have SIMD units MMX, SSE, 3DNow!, AltiVec Hardware likely to make heavy use of SIMD GPUs Ageia PhysX Super computers Signal processors in multimedia devices or sensors Potential to be order(s) of magnitude faster in pure math...
It's not all peachy Previous example was construed to show data and instruction parallelism. Sum of 1+2 was not dependent on 2+3 or vice versa. SIMDizing A + B + C + D is less efficient SIMD can’t help with A + B + C. Data must be aligned properly in vectors.
Cell Overview Asymmetric multicore processor designed by STI group(Sony/Toshiba/IBM) for high performance computing applications. 1 Power Processing Element (PPE) Much like Intel/AMD desktop CPUs but cheaper. 8 Synergistic Processing Elements (SPE) SIMD processors Element Interconnect Bus (EIB) 4 ring 16b wide unidirectional bus Connects PPE, SPEs, main memory, FlexIO. Many ways to accomplish inter-element communication
PPE 64-bit 3.2Ghz Dual threaded 512KB L2 cache In order execution Transparent access to main memory SIMD unit called AltiVec
SPE SIMD core clocked at 3.2GHz Even pipeline for most execution instructions Odd pipeline for load/store/DMA/branch hints 128 x 128b register file 256KB Local Store Very very low latency (7 cycles) Memory Flow Controller No direct access to main memory! No branch prediction hardware
EIB Arbitrates all communication between elements in the Cell. Runs at half the clock speed of the PPE and SPEs ~300Gb/s theoretical bandwidth 200Gb/s observed Main memory is connected to EIB – not the PPE. FlexIO
Intercore Communication DMA (Direct Memory Access) Like memcpy on crack. Used primary to keep data moving Latency ~ hundreds of cycles Non blocking calls Mailboxes Short messages to and from SPEs Signals and Interrupts
Implementation Can use stock ray tracer and simply run on PPE Lets make it run faster! Multithreading (partitioning, synchronization) Pthreads, libspe2 SIMDzing (both on PPE AltiVec and SPUs) IBM Toolchain Language intrinsics General C multiplatform development
Multithreading or Partitioning? How do we divide up work between N spes? Goals are to load balance and minimize synchronization Data driven design
Data Partitioning Each SPE is given basic information Scene address, frame buffer address, ray buffer address, number of pixels to process, sqrt(samples per pixel), numSpes, depth Entire scene copied over to LS via DMA. Very small. SPE(N) processes every Nth pixel Load balances very well No synchronization with PPE necessary*
First Approach PPE Setup scene Generate primary rays in ray buffers Initialize work loads for N SPEs Launch each SPE thread Wait for SPEs to finish running. When SPE are done, it means frame buffer is ready SPE Receive workload Use multibuffered DMA to transfer and process primary rays Output pixels using DMA list to main memory. Terminate when DMA for outgoing pixels is complete.
First Approach = FAIL DMA is hard to use for streaming data... Must implement multibuffered solution for streaming rays Outputting scattered pixels is a pain Generating all primary rays at once uses a lot of memory. Ray size: 48bytes (36bytes->48bytes with padding) Pixel size: 4bytes. 3MB frame buffer -> 36MB ray buffer (1024x768) Super sampling makes it even worse! 3MB frame buffer -> 144MB ray buffer at 4x super sampling Removed ray buffer from normal ray tracer as well.
Second Approach PPE Setup scene Extra view plane information in scene Initialize work loads for N spes Launch each SPE thread Poll each SPE's outbound mailbox for outgoing pixels. Write pixel to frame buffer. Done when last pixel received. SPE Receive info about workload Generate and process primary rays on the fly Output pixel using mailbox. (blocking call)
SIMDizing Implemented my own vector library Modified using conditional compilation Later removed entire calls to my own functions IBM Toolchain allows for some clean syntax. Some of the compiler intrinsics work for both PPE AltiVec and SPEs.
General C Multiplatform Development Loading scene was identical (PPE) Re-wrote start up code for PPE and SPE Ray tracing / shading algorithms Identical on SPE C math not available. SIMD Math instead.
Challenges First time doing something like this. IBM SDK does not install easily. Using IBM SDK libraries was a pain. Indirect nesting in data structures Keeping data structures the same size on PPE and SPE. PPE pointer: 64bits; SPE pointer: 32bits Aligning memory...
This explains most of it... “Another common error in SPU programs is a DMA that specifies an invalid combination of Local Store address, effective address, and transfer size. The alignment rules for DMAs specify that transfers for less than 16 bytes must be "naturally aligned," meaning that the address must be divisible by the size. Transfers of 16 bytes or more must be 16-byte aligned. The size can have a value of 1, 2, 4, 8, 16, or a multiple of 16 bytes to a maximum of 16KB. In addition, the low-order four bits of the Local Store address must match the low-order four bits of effective address (in other words, they must have the same alignment within a quadword). Any DMA that violates one of these rules will generate an alignment exception which is presented to the user as a bus error.”
Demo (sort of...)
Performance Results Recursion depth of 4 PS3 Cell implies use of 6 SPEs No PPE multithreading Pentium 4 trials used no hyperthreading.
Performance Results
Test run at 1600x1200x4 Run Time versus Number of SPEs
Future Improvements Partition data differently Eliminate all synchronization More efficient use of SIMD Colors were essentially vectors. Eliminate branches, add branch hints Learn to use IBM's profiling tools Real time rendering is possible!
Questions?