Comparison of Cell and POWER5 Architectures for a Flocking Algorithm A Performance and Usability Study CS267 Final Project Jonathan Ellithorpe Mark Howison Daniel Killebrew
Agent-Based Model of Flocking Flocking agents follow simple rules: Don't crowd other agents. Align your velocity with your neighbors' average velocity. Move toward the center of gravity of your neighbors. Move stochastically.
Serial Implementation: The Grid Spatial decomposition into a moving grid that follows the agents’ center of gravity Performs better than the naïve implementation during flock formation
OpenMP on POWER5: Spatial Layout The parallel for construct defaults to rows A Hilbert curve provides better load-balancing Hilbert curve layouts for 8x8, 16x16, and 32x32 grid sizes.
OpenMP on POWER5: Performance
OpenMP on POWER5: Profile
QuadTree Two dimensional dynamic spatial decomposition When a square reaches capacity, split it up
QuadTree balancing Unbalanced code still has some speedup because the total simulation space is divided among more processors Mass flock movement requires balancing the quadtree among threads by reassigning areas of the simulation space
QuadTree optimizations Can adjust the maximum number of occupants before splitting a cell, as well as the minimum number before recombining a cell A lower max prevents spurious inter-boid computation A higher minimum prevents checking more quads for interaction than necessary Min and max that are too close means too much quad splitting/recombining
Cell Broadband Engine Architecture - Developed by Sony, Toshiba, IBM - 8 SPEs, 1 PPE - PS3 has 7 SPEs (annoying) - High bandwidth interconnect (205GB/s peak)
Hardware support for Communication SPEs to PPE – or – PPE to SPEs – SPE Mailboxes (32-bit messages) 4 inbound 2 outbound (total) Can use mailboxes to talk SPE-SPE, but must setup memory mapping – DMA Transfers Must be 16B aligned Transfer from main memory to local store
Flocking on SPEs, first go Used Function-Offload parallel programming model Shipped off call to interact_fish() to 4 spes. (must use pthreads to do this) Each get pointers to data in main memory DMA in the data, calculate ax ay, write back
Performance
Had 5 more goes at it 1) Function-Offload interact_fish() 2) || move(), need to reduce dt 3) 4 -> 6 SPEs (8 not good) 4) Double-Precision -> Single 5) Remove dt, precalculate 6) Move to Streaming Model
Performance 1) Function-Offload interact_fish() 2) || move(), need to reduce dt 3) 4 -> 6 SPEs (8 not good) 4) Double-Precision -> Single 5) Remove dt, precalculate 6) Move to Streaming Model Still a lot of performance enhancing options 1) SIMDization of code: need SOA, not AOS 2) Reducing branch penalty on SPE – branch hint statements 3) Minimize agents transfer 4) QuadTree on SPEs 5) SPE->SPE communication
Arch Nemesis: Mailbox Waiting
Defeating Mailbox Waiting
Lastly, Usability 256KB LS = BAD Mostly low level “generic” C functions Weird context swapping Programmer intimate w/ hardware :( High memory bandwidth Code overlay (demand paging) Virtual Caches SPEs can run different code Programmer intimate w/ hardware :)
Questions? Mark Howison Jonathan Ellithorpe Daniel Killebrew