Comparison of Cell and POWER5 Architectures for a Flocking Algorithm A Performance and Usability Study CS267 Final Project Jonathan Ellithorpe Mark Howison.

Slides:



Advertisements
Similar presentations
Silberschatz, Galvin and Gagne Operating System Concepts Disk Scheduling Disk IO requests are for blocks, by number Block requests come in an.
Advertisements

Simple but slow: O(n 2 ) algorithms Serial Algorithm Algorithm compares each particle with every other particle and checks for the interaction radius Most.
Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.
A Seamless Communication Solution for Hybrid Cell Clusters Natalie Girard Bill Gardner, John Carter, Gary Grewal University of Guelph, Canada.
CA 714CA Midterm Review. C5 Cache Optimization Reduce miss penalty –Hardware and software Reduce miss rate –Hardware and software Reduce hit time –Hardware.
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
CML Efficient & Effective Code Management for Software Managed Multicores CODES+ISSS 2013, Montreal, Canada Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce.
Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
Dynamic Load Balancing for VORPAL Viktor Przebinda Center for Integrated Plasma Studies.
Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.
Ido Tov & Matan Raveh Parallel Processing ( ) January 2014 Electrical and Computer Engineering DPT. Ben-Gurion University.
ACCELERATING MATRIX LANGUAGES WITH THE CELL BROADBAND ENGINE Raymes Khoury The University of Sydney.
CSC 4250 Computer Architectures December 8, 2006 Chapter 5. Memory Hierarchy.
Using Cell Processors for Intrusion Detection through Regular Expression Matching with Speculation Author: C˘at˘alin Radu, C˘at˘alin Leordeanu, Valentin.
Programming Multiprocessors with Explicitly Managed Memory Hierarchies ELEC 6200 Xin Jin 4/30/2010.
Software Group © 2006 IBM Corporation Compiler Technology Task, thread and processor — OpenMP 3.0 and beyond Guansong Zhang, IBM Toronto Lab.
Development of a Ray Casting Application for the Cell Broadband Engine Architecture Shuo Wang University of Minnesota Twin Cities Matthew Broten Institute.
Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula
Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.
Cell Broadband Processor Daniel Bagley Meng Tan. Agenda  General Intro  History of development  Technical overview of architecture  Detailed technical.
Parallelization: Conway’s Game of Life. Cellular automata: Important for science Biology – Mapping brain tumor growth Ecology – Interactions of species.
J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, and D. Shippy IBM Systems and Technology Group IBM Journal of Research and Development.
Programming the Cell Multiprocessor Işıl ÖZ. Outline Cell processor – Objectives – Design and architecture Programming the cell – Programming models CellSs.
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
Cell Architecture. Introduction The Cell concept was originally thought up by Sony Computer Entertainment inc. of Japan, for the PlayStation 3 The architecture.
Agenda Performance highlights of Cell Target applications
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
High Performance Linear Transform Program Generation for the Cell BE
Computational issues in Carbon nanotube simulation Ashok Srinivasan Department of Computer Science Florida State University.
Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.
Gedae Portability: From Simulation to DSPs to the Cell Broadband Engine James Steed, William Lundgren, Kerry Barnes Gedae, Inc
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.
Programming Examples that Expose Efficiency Issues for the Cell Broadband Engine Architecture William Lundgren Gedae), Rick Pancoast.
© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.
Seunghwa Kang David A. Bader Optimizing Discrete Wavelet Transform on the Cell Broadband Engine.
1 The IBM Cell Processor – Architecture and On-Chip Communication Interconnect.
Kevin Eady Ben Plunkett Prateeksha Satyamoorthy.
L/O/G/O Cache Memory Chapter 3 (b) CS.216 Computer Architecture and Organization.
Optimization of Collective Communication in Intra- Cell MPI Optimization of Collective Communication in Intra- Cell MPI Ashok Srinivasan Florida State.
Sep 08, 2009 SPEEDUP – Optimization and Porting of Path Integral MC Code to New Computing Architectures V. Slavnić, A. Balaž, D. Stojiljković, A. Belić,
High Performance Computing Group Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture Feasibility Study of MPI.
ATmospheric, Meteorological, and Environmental Technologies RAMS Parallel Processing Techniques.
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
Smoothed Particle Hydrodynamics Matthew Zhu CSCI 5551 — Fall 2015.
PARALLEL TREE MANIPULATION Islam Atta. Sources Islam Atta, Hisham El-Shishiny. System and method for parallel processing TX, US, Experimental.
Optimizing Ray Tracing on the Cell Microprocessor David Oguns.
Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.
Aarul Jain CSE520, Advanced Computer Architecture Fall 2007.
Parallel Programming in Chess Simulations Part 2 Tyler Patton.
Barnes Hut – A Broad Review Abhinav S Bhatele The 27th day of April, 2006.
FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine David A. Bader, Virat Agarwal.
UNC Chapel Hill David A. O’Brien Automatic Simplification of Particle System Dynamics David O’Brien Susan Fisher Ming C. Lin Department of Computer Science.
IBM Cell Processor Ryan Carlson, Yannick Lanner-Cusin, & Cyrus Stoller CS87: Parallel and Distributed Computing.
1/21 Cell Processor Systems Seminar Diana Palsetia (11/21/2006)
Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.
ESE532: System-on-a-Chip Architecture
Ioannis E. Venetis Department of Computer Engineering and Informatics
High Performance Computing on an IBM Cell Processor --- Bioinformatics
Cell Architecture.
Lecture 5: GPU Compute Architecture
Lecture 5: GPU Compute Architecture for the last time
Lecture 23: Cache, Memory, Virtual Memory
Lecture 22: Cache Hierarchies, Memory
Overview Continuation from Monday (File system implementation)
Lecture 24: Memory, VM, Multiproc
Lecture 24: Virtual Memory, Multiprocessors
Lecture 23: Virtual Memory, Multiprocessors
IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale
Multicore and GPU Programming
Presentation transcript:

Comparison of Cell and POWER5 Architectures for a Flocking Algorithm A Performance and Usability Study CS267 Final Project Jonathan Ellithorpe Mark Howison Daniel Killebrew

Agent-Based Model of Flocking Flocking agents follow simple rules: Don't crowd other agents. Align your velocity with your neighbors' average velocity. Move toward the center of gravity of your neighbors. Move stochastically.

Serial Implementation: The Grid Spatial decomposition into a moving grid that follows the agents’ center of gravity Performs better than the naïve implementation during flock formation

OpenMP on POWER5: Spatial Layout The parallel for construct defaults to rows A Hilbert curve provides better load-balancing Hilbert curve layouts for 8x8, 16x16, and 32x32 grid sizes.

OpenMP on POWER5: Performance

OpenMP on POWER5: Profile

QuadTree Two dimensional dynamic spatial decomposition When a square reaches capacity, split it up

QuadTree balancing Unbalanced code still has some speedup because the total simulation space is divided among more processors Mass flock movement requires balancing the quadtree among threads by reassigning areas of the simulation space

QuadTree optimizations Can adjust the maximum number of occupants before splitting a cell, as well as the minimum number before recombining a cell A lower max prevents spurious inter-boid computation A higher minimum prevents checking more quads for interaction than necessary Min and max that are too close means too much quad splitting/recombining

Cell Broadband Engine Architecture - Developed by Sony, Toshiba, IBM - 8 SPEs, 1 PPE - PS3 has 7 SPEs (annoying)‏ - High bandwidth interconnect (205GB/s peak)‏

Hardware support for Communication SPEs to PPE – or – PPE to SPEs – SPE Mailboxes (32-bit messages)‏ 4 inbound 2 outbound (total)‏ Can use mailboxes to talk SPE-SPE, but must setup memory mapping – DMA Transfers Must be 16B aligned Transfer from main memory to local store

Flocking on SPEs, first go Used Function-Offload parallel programming model Shipped off call to interact_fish() to 4 spes. (must use pthreads to do this)‏ Each get pointers to data in main memory DMA in the data, calculate ax ay, write back

Performance

Had 5 more goes at it 1) Function-Offload interact_fish()‏ 2) || move(), need to reduce dt 3) 4 -> 6 SPEs (8 not good)‏ 4) Double-Precision -> Single 5) Remove dt, precalculate 6) Move to Streaming Model

Performance 1) Function-Offload interact_fish()‏ 2) || move(), need to reduce dt 3) 4 -> 6 SPEs (8 not good)‏ 4) Double-Precision -> Single 5) Remove dt, precalculate 6) Move to Streaming Model Still a lot of performance enhancing options 1) SIMDization of code: need SOA, not AOS 2) Reducing branch penalty on SPE – branch hint statements 3) Minimize agents transfer 4) QuadTree on SPEs 5) SPE->SPE communication

Arch Nemesis: Mailbox Waiting

Defeating Mailbox Waiting

Lastly, Usability 256KB LS = BAD Mostly low level “generic” C functions Weird context swapping Programmer intimate w/ hardware :( High memory bandwidth Code overlay (demand paging)‏ Virtual Caches SPEs can run different code Programmer intimate w/ hardware :)‏

Questions? Mark Howison Jonathan Ellithorpe Daniel Killebrew