Development of a Ray Casting Application for the Cell Broadband Engine Architecture Shuo Wang University of Minnesota Twin Cities Matthew Broten Institute.

Slides:



Advertisements
Similar presentations
Parallel Processing with PlayStation3 Lawrence Kalisz.
Advertisements

Part IV: Memory Management
Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.
A Seamless Communication Solution for Hybrid Cell Clusters Natalie Girard Bill Gardner, John Carter, Gary Grewal University of Guelph, Canada.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
Ray tracing. New Concepts The recursive ray tracing algorithm Generating eye rays Non Real-time rendering.
MPI and C-Language Seminars Seminar Plan (1/3)  Aim: Introduce the ‘C’ Programming Language.  Plan to cover: Basic C, and programming techniques.
Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.
Informationsteknologi Thursday, September 6, 2007Computer Systems/Operating Systems - Class 21 Today’s class Finish computer system overview Review of.
Using Cell Processors for Intrusion Detection through Regular Expression Matching with Speculation Author: C˘at˘alin Radu, C˘at˘alin Leordeanu, Valentin.
Rasterization and Ray Tracing in Real-Time Applications (Games) Andrew Graff.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
I/O Hardware n Incredible variety of I/O devices n Common concepts: – Port – connection point to the computer – Bus (daisy chain or shared direct access)
1 Computer System Overview OS-1 Course AA
CHAPTER 9: Input / Output
Programming the Cell Multiprocessor Işıl ÖZ. Outline Cell processor – Objectives – Design and architecture Programming the cell – Programming models CellSs.
Computer Architecture Lecture 01 Fasih ur Rehman.
Cell Architecture. Introduction The Cell concept was originally thought up by Sony Computer Entertainment inc. of Japan, for the PlayStation 3 The architecture.
Cell/B.E. Jiří Dokulil. Introduction Cell Broadband Engine developed Sony, Toshiba and IBM 64bit PowerPC PowerPC Processor Element (PPE) runs OS SIMD.
CS470/570 Lecture 5 Introduction to OpenMP Compute Pi example OpenMP directives and options.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
1 Computer System Overview Chapter 1. 2 n An Operating System makes the computing power available to users by controlling the hardware n Let us review.
CHAPTER 9: Input / Output
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.
Computer Architecture Lecture10: Input/output devices Piotr Bilski.
March 12, 2007 Introduction to PS3 Cell BE Programming Narate Taerat.
Programming Examples that Expose Efficiency Issues for the Cell Broadband Engine Architecture William Lundgren Gedae), Rick Pancoast.
Processes and Threads CS550 Operating Systems. Processes and Threads These exist only at execution time They have fast state changes -> in memory and.
1 The IBM Cell Processor – Architecture and On-Chip Communication Interconnect.
Robert Liao Tracy Wang CS252 Spring Overview Traditional GPU Architecture The NVIDIA G80 Processor CUDA (Compute Unified Device Architecture) LAPACK.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
Kevin Eady Ben Plunkett Prateeksha Satyamoorthy.
Gene Au-yeung, Daniel Quach, Jeffrey Su, Albert Wang, Jessica Wang, David Woo.
Optimization of Collective Communication in Intra- Cell MPI Optimization of Collective Communication in Intra- Cell MPI Ashok Srinivasan Florida State.
Toolkits version 1.0 Special Cource on Computer Architectures
Cell Processor Programming: An introduction Pascal Comte Brock University, Fall 2007.
Parallel Ray Tracer Computer Systems Lab Presentation Stuart Maier.
L/O/G/O Input Output Chapter 4 CS.216 Computer Architecture and Organization.
High Performance Computing Group Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture Feasibility Study of MPI.
Topic 3: C Basics CSE 30: Computer Organization and Systems Programming Winter 2011 Prof. Ryan Kastner Dept. of Computer Science and Engineering University.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung Wong Chung Hoi Supervised by Prof. Michael R. Lyu Department of Computer.
A user level multi-threaded Particle simulator Supervisor: Joe Cordina Observer: Kurt Debattista.
Slide-1 Multicore Theory MIT Lincoln Laboratory Theory of Multicore Algorithms Jeremy Kepner and Nadya Bliss MIT Lincoln Laboratory HPEC 2008 This work.
Topics memory alignment and structures typedef for struct names bitwise & for viewing bits malloc and free (dynamic storage in C) new and delete (dynamic.
Optimizing Ray Tracing on the Cell Microprocessor David Oguns.
Comparison of Cell and POWER5 Architectures for a Flocking Algorithm A Performance and Usability Study CS267 Final Project Jonathan Ellithorpe Mark Howison.
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
Fundamentals of Programming Languages-II
Aurora/PetaQCD/QPACE Metting Regensburg University, April 14-15, 2010.
Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.
Aarul Jain CSE520, Advanced Computer Architecture Fall 2007.
Silberschatz, Galvin, and Gagne  Applied Operating System Concepts Module 12: I/O Systems I/O hardwared Application I/O Interface Kernel I/O.
ANR Meeting / PetaQCD LAL / Paris-Sud University, May 10-11, 2010.
FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine David A. Bader, Virat Agarwal.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
1/21 Cell Processor Systems Seminar Diana Palsetia (11/21/2006)
Module 12: I/O Systems I/O hardware Application I/O Interface
Parallel Programming By J. H. Wang May 2, 2017.
TerraForm3D Plasma Works 3D Engine & USGS Terrain Modeler
So far… Text RO …. printf() RW link printf Linking, loading
Operating System Concepts
13: I/O Systems I/O hardwared Application I/O Interface
CS703 - Advanced Operating Systems
Implementation of neural gas on Cell Broadband Engine
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Large data arrays processing on Cell Broadband Engine
Multicore and GPU Programming
Module 12: I/O Systems I/O hardwared Application I/O Interface
Presentation transcript:

Development of a Ray Casting Application for the Cell Broadband Engine Architecture Shuo Wang University of Minnesota Twin Cities Matthew Broten Institute of Technology, University of Minnesota Twin Cities Professor David A. Yuen

Overview General Overview Programming for the Cell Architecture Ray Casting Theory and Mathematics Ray Casting Application Development

Overview: Why the Cell? Novel  2005 was when Linux support became available Affordable  A PlayStation 3 costs $650 Fast  25.6 GFLOPs

Overview: What To Do Computationally challenging even if it’s mathematically simple Accuracy is less crucial than speed Easy to visualize

Overview – Results of Internship IEEE 2007 in Sacramento, CA SuperComputing 2007 in Reno, NV

Programming for the Cell Architecture

Programming for the Cell Architecture: Challenges Cooperation between PPE and SPEs SPE memory limitations SPE side code vectorization

Programming for the Cell Architecture: Introductory Knowledge Application Organization Division of Computational Labor SPE Program Initialization Communication Between PPE and SPEs Data Transfer with DMA Manual Optimization Automatic Optimization

Programming for the Cell Architecture: Application Organization Top Level./spu

Programming for the Cell Architecture: Parallelism Models Task ParallelismData Parallelism

Programming for the Cell Architecture: SPE Thread Creation PPE program uses interface provided by libspe, a SPE runtime management library extern spe_program_handle_t MyProgram_spu; int main(int argc, char **argv) {... speid_t spe_id; spe_id = spe_create_thread(threadGroup, &MyProgram_spu, &controlBlock,... );... }

Programming for the Cell Architecture: Communication Using Mailboxes Mailboxes provide a method of communication between the PPE and the SPEs 3 Mailbox queues are provided in the Memory Flow Controller of each SPE PPE Mailbox Queue PPE Interrupt Mailbox Queue SPU Mailbox Queue

Programming for the Cell Architecture: Communication Using Mailboxes PPE Mailbox Queue: SPE writes message, PPE reads message unsigned int value; value = spe_read_out_inbox(spe_id); PPE side code to receive a message: SPE side code to write a message: unsigned int value; spu_write_out_mbox(value);  PPE Interrupt Mailbox Queue works similarly

Programming for the Cell Architecture: Communication Using Mailboxes SPU Mailbox Queue: PPE writes message, SPE reads message unsigned int value; value = spu_read_in_mbox(); SPE side code to receive a message: PPE side code to write a message: unsigned int value; spe_write_in_mbox(spe_id, value);

Programming for the Cell Architecture: Data Transfer with DMA (1) SPU tells DMA engine that data is needed in main memory (2) DMA engine requests data from main memory (3) DMA engine copies data from main memory to the local store

Programming for the Cell Architecture: Data Transfer with DMA Each MFC can process a queue of 24 DMA commands Each transfer must be a multiple of 16 bytes Maximum of 16 KB per transfer

Programming for the Cell Architecture: Data Transfer with DMA Primary Operations: The GET command copies data from main memory to local store The PUT command copies data from local store to main memory GET PUT SPE SIDE mfc_get mfc_put PPE SIDE spe_mfc_get spe_mfc_put

Programming for the Cell Architecture: Control Blocks typedef struct _control_block { uintptr32_t arrayAddress1; unsigned int value1; uintptr32_t arrayAddress2; unsigned int value2; } control_block; What is a control block? Example control block:

Programming for the Cell Architecture: Data Transfer with DMA General Approach (main memory to local store): (1) PPE: define and initialize control block in main memory (2) PPE: pass reference to control block when creating SPE thread (3) SPE: allocate memory in local store for control block and other data to be transferred (4) SPE: copy control block from main memory to local store (5) SPE: use address in control block to copy other data from main memory to local store

Programming for the Cell Architecture: Pipelining Optimization not in place: Pipeline Optimization:

Programming for the Cell Architecture: Compilers GCC vs IBM XLC Data from Eric Rollins “Ray Tracing” Application Graph provided by Eric Rollins:

Ray Casting Theory and Mathematics - Overview

Ray Casting Theory and Mathematics: Math Triangles defined by three vertex points A, B, and C in R3 If there is an intersection between the ray and the triangle, then P = E + tV, where  P = point of intersection between ray and triangle  E = location of eye  V = directional ray from the eye to the pixel of interest  t represents the distance from the point of intersection to E along V

Ray Casting Theory and Mathematics: Math If is 0, where N is the normal of the triangle, then there is no intersection, try the next pixel. Else, compute P  D= - (A is a point of the triangle) t = -( - D) / P = E + tV Check that P lies in the triangle defined by A,B,C: if P is in the triangle ABC the sign of these three will be the same: inA = inB = inC = Calculate diffusions

Ray Casting Theory and Mathematics – Pseudo Code spu/trace.h For (width of screen) {  For (height of screen) { For (all objects in screen) {  Find edges of objects  if (ray crosses object) { Calculate Reflections  } }  } }

Ray Casting Application Development

Overview Development Roadmap Current Capabilities Implementation Details Future Goals

Ray Casting Application: Overview Created an enhanced version of Eric Rollins' open source “Real-Time Ray Tracing” application (1) (2) (3)

Ray Casting Application: Development Roadmap (1) Learning and exploration of Eric Rollins' “Ray Tracing” package (2) Enhancement of “trace algorithm” for rendering of triangles (2) Implementation of translation and rotation functionality (3) Implementation of triangle initialization and transfer mechanism

Ray Casting Application: Development Roadmap (1) Exploration of Eric Rollins' open source application

Ray Casting Application: Development Roadmap (2) Enhancement of “trace algorithm” for rendering of triangles

Ray Casting Application: Development Roadmap (3) Implementation of translation and rotation functionality

Ray Casting Application: Development Roadmap (4) Implementation of triangle initialization and transfer mechanisms Each triangle structure: - contains 3 float vectors; each float vector contains three coordinates (X, Y, Z) and represent a point of the triangle - consumes 48 bytes of memory since each float vector requires 16 bytes DMA transfers must be 16 KB or less and a size that’s a multiple of 16 bytes - - This amounts to a max of 336 triangles per transfer About 189 KB free in local store - Enough room for 11 transfers of 336 triangles which is a total of 3969 triangles

Ray Casting Application: Current Capabilities

Ray Casting Application: Implementation Details Application Organization: two programs: - one executes on the PPU - one runs on each SPU Division of Labor: task parallelism where each SPE: - holds identical data in its local store - is responsible for doing computations for 1/6 of lines rendered to screen

Ray Casting Application: PPE Program Life Cycle (1) (2) (3) (4) (5)

Ray Casting Application: SPE Program Life Cycle (1) (2) (3) (4) (5)

Ray Casting Application: Future Goals Visualize larger datasets - Now: limited to the rendering of about 4000 triangles - Goal: develop mechanisms to render hundreds of thousands of triangles Distribute computation over several PS3s - Now: all computation performed on single PS3 - Goal: build a cluster of PS3s and increase application performance by dividing workload among PS3s in the cluster

PS3 Wiki URL For more information: