Download presentation
Presentation is loading. Please wait.
Published byGeorgiana Margaret Briggs Modified over 9 years ago
1
Comparison of Cell and POWER5 Architectures for a Flocking Algorithm A Performance and Usability Study CS267 Final Project Jonathan Ellithorpe Mark Howison Daniel Killebrew
2
Agent-Based Model of Flocking Flocking agents follow simple rules: Don't crowd other agents. Align your velocity with your neighbors' average velocity. Move toward the center of gravity of your neighbors. Move stochastically.
3
Serial Implementation: The Grid Spatial decomposition into a moving grid that follows the agents’ center of gravity Performs better than the naïve implementation during flock formation
4
OpenMP on POWER5: Spatial Layout The parallel for construct defaults to rows A Hilbert curve provides better load-balancing Hilbert curve layouts for 8x8, 16x16, and 32x32 grid sizes.
5
OpenMP on POWER5: Performance
6
OpenMP on POWER5: Profile
7
QuadTree Two dimensional dynamic spatial decomposition When a square reaches capacity, split it up
8
QuadTree balancing Unbalanced code still has some speedup because the total simulation space is divided among more processors Mass flock movement requires balancing the quadtree among threads by reassigning areas of the simulation space
9
QuadTree optimizations Can adjust the maximum number of occupants before splitting a cell, as well as the minimum number before recombining a cell A lower max prevents spurious inter-boid computation A higher minimum prevents checking more quads for interaction than necessary Min and max that are too close means too much quad splitting/recombining
10
Cell Broadband Engine Architecture - Developed by Sony, Toshiba, IBM - 8 SPEs, 1 PPE - PS3 has 7 SPEs (annoying) - High bandwidth interconnect (205GB/s peak)
11
Hardware support for Communication SPEs to PPE – or – PPE to SPEs – SPE Mailboxes (32-bit messages) 4 inbound 2 outbound (total) Can use mailboxes to talk SPE-SPE, but must setup memory mapping – DMA Transfers Must be 16B aligned Transfer from main memory to local store
12
Flocking on SPEs, first go Used Function-Offload parallel programming model Shipped off call to interact_fish() to 4 spes. (must use pthreads to do this) Each get pointers to data in main memory DMA in the data, calculate ax ay, write back
13
Performance
14
Had 5 more goes at it 1) Function-Offload interact_fish() 2) || move(), need to reduce dt 3) 4 -> 6 SPEs (8 not good) 4) Double-Precision -> Single 5) Remove dt, precalculate 6) Move to Streaming Model
15
Performance 1) Function-Offload interact_fish() 2) || move(), need to reduce dt 3) 4 -> 6 SPEs (8 not good) 4) Double-Precision -> Single 5) Remove dt, precalculate 6) Move to Streaming Model Still a lot of performance enhancing options 1) SIMDization of code: need SOA, not AOS 2) Reducing branch penalty on SPE – branch hint statements 3) Minimize agents transfer 4) QuadTree on SPEs 5) SPE->SPE communication
16
Arch Nemesis: Mailbox Waiting
17
Defeating Mailbox Waiting
18
Lastly, Usability 256KB LS = BAD Mostly low level “generic” C functions Weird context swapping Programmer intimate w/ hardware :( High memory bandwidth Code overlay (demand paging) Virtual Caches SPEs can run different code Programmer intimate w/ hardware :)
19
Questions? Mark Howison Jonathan Ellithorpe Daniel Killebrew
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.