CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware www.cis.udel.edu/~cavazos/cisc879.

Slides:



Advertisements
Similar presentations
Computer Architecture
Advertisements

3.1 Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition Process An operating system executes a variety of programs: Batch system.
Operating Systems Lecture 7.
Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.
A Seamless Communication Solution for Hybrid Cell Clusters Natalie Girard Bill Gardner, John Carter, Gary Grewal University of Guelph, Canada.
1 (Review of Prerequisite Material). Processes are an abstraction of the operation of computers. So, to understand operating systems, one must have a.
Processes CSCI 444/544 Operating Systems Fall 2008.
Architectural Support for Operating Systems. Announcements Most office hours are finalized Assignments up every Wednesday, due next week CS 415 section.
CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware
OS Spring’03 Introduction Operating Systems Spring 2003.
Midterm Tuesday October 23 Covers Chapters 3 through 6 - Buses, Clocks, Timing, Edge Triggering, Level Triggering - Cache Memory Systems - Internal Memory.
Synergistic Processing In Cell’s Multicore Architecture Michael Gschwind, et al. Presented by: Jia Zou CS258 3/5/08.
CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.
I/O Tanenbaum, ch. 5 p. 329 – 427 Silberschatz, ch. 13 p
Chapter 4: Threads Adapted to COP4610 by Robert van Engelen.
General System Architecture and I/O.  I/O devices and the CPU can execute concurrently.  Each device controller is in charge of a particular device.
Cell/B.E. Jiří Dokulil. Introduction Cell Broadband Engine developed Sony, Toshiba and IBM 64bit PowerPC PowerPC Processor Element (PPE) runs OS SIMD.
1 Lecture 4: Threads Operating System Fall Contents Overview: Processes & Threads Benefits of Threads Thread State and Operations User Thread.
MICROPROCESSOR INPUT/OUTPUT
CHAPTER 3 TOP LEVEL VIEW OF COMPUTER FUNCTION AND INTERCONNECTION
Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.
March 12, 2007 Introduction to PS3 Cell BE Programming Narate Taerat.
Advanced / Other Programming Models Sathish Vadhiyar.
Lecture 3 Process Concepts. What is a Process? A process is the dynamic execution context of an executing program. Several processes may run concurrently,
1 The IBM Cell Processor – Architecture and On-Chip Communication Interconnect.
Chapter 2: Computer-System Structures Computer System Operation I/O Structure Storage Structure Storage Hierarchy Hardware Protection Network Structure.
Kevin Eady Ben Plunkett Prateeksha Satyamoorthy.
OpenCL Sathish Vadhiyar Sources: OpenCL quick overview from AMD OpenCL learning kit from AMD.
Source: Operating System Concepts by Silberschatz, Galvin and Gagne.
CS333 Intro to Operating Systems Jonathan Walpole.
Cell Processor Programming: An introduction Pascal Comte Brock University, Fall 2007.
Processor Architecture
CE Operating Systems Lecture 2 Low level hardware support for operating systems.
LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung Wong Chung Hoi Supervised by Prof. Michael R. Lyu Department of Computer.
Chapter 13 – I/O Systems (Pgs ). Devices  Two conflicting properties A. Growing uniformity in interfaces (both h/w and s/w): e.g., USB, TWAIN.
Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 4 Computer Systems Review.
Silberschatz, Galvin and Gagne  Applied Operating System Concepts Chapter 2: Computer-System Structures Computer System Architecture and Operation.
CE Operating Systems Lecture 2 Low level hardware support for operating systems.
1 Lecture 1: Computer System Structures We go over the aspects of computer architecture relevant to OS design  overview  input and output (I/O) organization.
Copyright ©: Nahrstedt, Angrave, Abdelzaher
Optimizing Ray Tracing on the Cell Microprocessor David Oguns.
OSes: 2. Structs 1 Operating Systems v Objective –to give a (selective) overview of computer system architectures Certificate Program in Software Development.
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
1 Device Controller I/O units typically consist of A mechanical component: the device itself An electronic component: the device controller or adapter.
My Coordinates Office EM G.27 contact time:
Process Related System Calls By Neha Hulkoti & Kavya Bhat.
Embedded Real-Time Systems Processing interrupts Lecturer Department University.
1 Lecture 5a: CPU architecture 101 boris.
STI Cell Broadband Engine Håvard Espeland. Cell Broadband Engine The world's fastest supercomputer “Roadrunner” features 12,240 Cell processors Heterogeneous.
STI Cell Broadband Engine Håvard Espeland. Cell Broadband Engine The world's fastest supercomputer “Roadrunner” uses Cell processors and 6.
STI Cell Broadband Engine Håvard Espeland. INF5063 – Håvard Espeland Cell Broadband Engine The world's 3 rd fastest supercomputer “Roadrunner” uses.
Introduction to Operating Systems Concepts
Chapter 2: Computer-System Structures(Hardware)
Chapter 2: Computer-System Structures
CS399 New Beginnings Jonathan Walpole.
Chapter 2 Processes and Threads Today 2.1 Processes 2.2 Threads
Developing Code for Cell – DMA & Mailbox
Computer Architecture
Module 2: Computer-System Structures
Lecture 4- Threads, SMP, and Microkernels
Operating System Concepts
Jonathan Walpole Computer Science Portland State University
Module 2: Computer-System Structures
Large data arrays processing on Cell Broadband Engine
Chapter 2: Computer-System Structures
Chapter 2: Computer-System Structures
Module 2: Computer-System Structures
Module 2: Computer-System Structures
Chapter 3: Process Management
Presentation transcript:

CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware Lecture 9 Cell Programming Tutorial

CISC 879 : Software Support for Multicore Architectures Lecture 9: Overview Cell Basics Programming Models Programming Details Example Code

CISC 879 : Software Support for Multicore Architectures Cell Architecture Recap Heterogeneous architecture 9 cores on chip 1 PPE (General Purpose Processor) 8 SPEs (SIMD processors) PPEs runs control-plane code Code with lots of brances (e.g., OS) SPEs runs data-plane code Computational code with little branches

CISC 879 : Software Support for Multicore Architectures Program Structure Multiple programs in one PPU and SPU programs cooperate PPE Code Regular Linux process (main thread) Process can spawn SPE threads SPE Code SPE executables are packaged inside PPE executables

CISC 879 : Software Support for Multicore Architectures SPE Details Register file: Large (128 entries), 128-bit, and unified All instructions are SIMD instructions Local Store (256 KB) Loads/Stores access LS Contains all Instructions/Data used by SPU DMA transfers data between LS and main storage High bandwidth (128 bytes per cycle) Eliminate non-deterministic features Out-of-order execution Hardware-managed caches Hardware branch prediction

CISC 879 : Software Support for Multicore Architectures SPE Register Layout The left-most word (bytes 0, 1, 2, and 3) of a register is called the preferred slot

CISC 879 : Software Support for Multicore Architectures SPE SIMD Example (add) Example is a 4-wide add – each of the 4 elements in reg VA is added to the corresponding element in reg VB – the 4 results are placed in the appropriate slots in reg VC

CISC 879 : Software Support for Multicore Architectures SPE SIMD Example (shuffle) Bytes selected from regs VA and VB based on control vector VC Control vector entries are indices of VA and VB Operation is purely byte oriented

CISC 879 : Software Support for Multicore Architectures SPE model Code and data must fit into 256-KB local store Explicit input/output of SPE program Program arguments and return code DMA Mailboxes Signals SPE Program PPE maps system memory for SPE DMA Local Store DMA transactions System Memory

CISC 879 : Software Support for Multicore Architectures SPE Model (cont’d) Streaming model for large size input/output data int g_ip[512*1024] System memory int g_op[512*1024] int ip[32] int op[32] SPE program: op = func(ip) DMA Local store

CISC 879 : Software Support for Multicore Architectures Programming Models How application/data partitioned among PPEs/SPEs Partitioning involves considering Program structure Data structures Data and code via DMA Several models: Data-parallel Task-Parallel Job Queue

CISC 879 : Software Support for Multicore Architectures Job Queue Code and data packaged together Job queue System memory Local store code/data n code/data n+1 code/data n+2 code/data … Code n Data n SPE kernel DMA

CISC 879 : Software Support for Multicore Architectures Data Parallel SPE initiated DMA Large array of data fed through SPEs Special case of Job Queue PPE SPE1 Kernel() SPE0 Kernel() SPE7 Kernel() System Memory InIn. I7I7 I6I6 I5I5 I4I4 I3I3 I2I2 I1I1 I0I0 OnOn. O7O7 O6O6 O5O5 O4O4 O3O3 O2O2 O1O1 O0O0 ….. Data-parallel

CISC 879 : Software Support for Multicore Architectures Task Parallel LS to LS DMA Flexible in pipeline functions Load balancing harder PPE SPE1 Kernel() SPE0 Kernel() SPE7 Kernel() System Memory InIn. I7I7 I6I6 I5I5 I4I4 I3I3 I2I2 I1I1 I0I0 OnOn. O7O7 O6O6 O5O5 O4O4 O3O3 O2O2 O1O1 O0O0 ….. DMA Task-parallel

CISC 879 : Software Support for Multicore Architectures Cell Terminology SPE Context Holds information about Logical SPE Used by Application SPE Gang Context Group of threads with same properties SPE Event Events caused by (asynchronously) running SPE threads Examples: SPE execution stopped, Mailbox messages written/read, DMA operations completed

CISC 879 : Software Support for Multicore Architectures LibSPE Version 2.0 SPE Context events PPE Thread Function PPE Thread policy priority SPE Stopinfo SPE Gang Context spe_program_handle_t spe_context_ptr_t SPE Program arguments environment Application Data argpenvp pthread_t application_data_t Conceptual Diagram

CISC 879 : Software Support for Multicore Architectures Single SPE Thread A simple application uses a single PPE thread Basic scheme for simple application using SPE: 1. Create an SPE context 2. Load executable object into the SPE context’s local store 3. Run SPE context Transfers control to OS requesting scheduling of the context to a physical SPE in the system 4. Destroy SPE context Note: Step 3 represents a synchronous call. Calling Application thread blocks until the SPE stops and returns.

CISC 879 : Software Support for Multicore Architectures Single Thread (Hello World) PPU #include int main() { spe_context_ptr_t spe; unsigned int createflags = 0; unsigned int runflags = 0; unsigned int entry = SPE_DEFAULT_ENTRY; void * argp = NULL; void * envp = NULL; spe_program_handle_t * program; program = spe_image_open("hello"); spe = spe_context_create(createflags, NULL); spe_program_load(spe, program); spe_context_run(spe, &entry, runflags, argp, envp, NULL); spe_image_close(program); spe_context_destroy(spe); } SPU #include int main() { printf("hello world\n"); return 0; }

CISC 879 : Software Support for Multicore Architectures Multiple SPE Threads May want multiple SPEs concurrently Create N PPE threads for N concurrent SPE contexts Each PPE thread runs single SPE context Basic for simple application running N SPE contexts 1. Create N SPE contexts 2. Load SPE executable into each SPE context’s local store 3. Create N PPE threads - In each PPE thread run one SPE context - Terminate PPE thread 4. Wait for all N PPE threads to terminate 5. Destroy all N SPE contexts

CISC 879 : Software Support for Multicore Architectures Multi-threaded (Hello World) #include #define N 4 struct thread_args { struct spe_context * spe; void * argp; void * envp; }; void my_spe_thread(struct thread_args *arg) { unsigned int runflags = 0; unsigned int entry = SPE_DEFAULT_ENTRY; // run SPE context spe_context_run(arg->spe, &entry, runflags, arg- >argp, arg->envp, NULL); // done - now exit thread pthread_exit(NULL); } int main() { pthread_t pts[N]; spe_context_ptr_t spe[N]; struct thread_args t_args[N]; int value[N]; int i; spe_program_handle_t * program; // open SPE program program = spe_image_open("hello"); for ( i=0; i<N; i++ ) { // create SPE context spe[i] = spe_context_create(0, NULL); // load SPE program spe_program_load(spe[i], program); // create pthread t_args[i].spe = spe[i]; t_args[i].argp = &value[i]; t_args[i].envp = NULL; pthread_create(&pts[i], NULL, &my_spe_thread, t_args[i]); } // wait for all threads to finish for ( i=0; i<N; i++ ) { pthread_join (pts[i], NULL); } // close SPE program spe_image_close(program); // destroy SPE contexts for ( i=0; i<N; i++ ) { spe_context_destroy (spe[i]); } return 0; }

CISC 879 : Software Support for Multicore Architectures Multi-threaded (Hello World) #include #define N 4 struct thread_args { struct spe_context * spe; void * argp; void * envp; }; void my_spe_thread(struct thread_args *arg) { unsigned int runflags = 0; unsigned int entry = SPE_DEFAULT_ENTRY; // run SPE context spe_context_run(arg->spe, &entry, runflags, arg- >argp, arg->envp, NULL); // done - now exit thread pthread_exit(NULL); } int main() { pthread_t pts[N]; spe_context_ptr_t spe[N]; struct thread_args t_args[N]; int value[N]; int i; spe_program_handle_t * program; // open SPE program program = spe_image_open("hello"); for ( i=0; i<N; i++ ) { // create SPE context spe[i] = spe_context_create(0, NULL); // load SPE program spe_program_load(spe[i], program); // create pthread t_args[i].spe = spe[i]; t_args[i].argp = &value[i]; t_args[i].envp = NULL; pthread_create(&pts[i], NULL, &my_spe_thread, t_args[i]); } // wait for all threads to finish for ( i=0; i<N; i++ ) { pthread_join (pts[i], NULL); } // close SPE program spe_image_close(program); // destroy SPE contexts for ( i=0; i<N; i++ ) { spe_context_destroy (spe[i]); } return 0; }

CISC 879 : Software Support for Multicore Architectures Multi-threaded (Hello World) #include #define N 4 struct thread_args { struct spe_context * spe; void * argp; void * envp; }; void my_spe_thread(struct thread_args *arg) { unsigned int runflags = 0; unsigned int entry = SPE_DEFAULT_ENTRY; // run SPE context spe_context_run(arg->spe, &entry, runflags, arg- >argp, arg->envp, NULL); // done - now exit thread pthread_exit(NULL); } int main() { pthread_t pts[N]; spe_context_ptr_t spe[N]; struct thread_args t_args[N]; int value[N]; int i; spe_program_handle_t * program; // open SPE program program = spe_image_open("hello"); for ( i=0; i<N; i++ ) { // create SPE context spe[i] = spe_context_create(0, NULL); // load SPE program spe_program_load(spe[i], program); // create pthread t_args[i].spe = spe[i]; t_args[i].argp = &value[i]; t_args[i].envp = NULL; pthread_create(&pts[i], NULL, &my_spe_thread, t_args[i]); } // wait for all threads to finish for ( i=0; i<N; i++ ) { pthread_join (pts[i], NULL); } // close SPE program spe_image_close(program); // destroy SPE contexts for ( i=0; i<N; i++ ) { spe_context_destroy (spe[i]); } return 0; }

CISC 879 : Software Support for Multicore Architectures Multi-threaded (Hello World) #include #define N 4 struct thread_args { struct spe_context * spe; void * argp; void * envp; }; void my_spe_thread(struct thread_args *arg) { unsigned int runflags = 0; unsigned int entry = SPE_DEFAULT_ENTRY; // run SPE context spe_context_run(arg->spe, &entry, runflags, arg- >argp, arg->envp, NULL); // done - now exit thread pthread_exit(NULL); } int main() { pthread_t pts[N]; spe_context_ptr_t spe[N]; struct thread_args t_args[N]; int value[N]; int i; spe_program_handle_t * program; // open SPE program program = spe_image_open("hello"); for ( i=0; i<N; i++ ) { // create SPE context spe[i] = spe_context_create(0, NULL); // load SPE program spe_program_load(spe[i], program); // create pthread t_args[i].spe = spe[i]; t_args[i].argp = &value[i]; t_args[i].envp = NULL; pthread_create(&pts[i], NULL, &my_spe_thread, t_args[i]); } // wait for all threads to finish for ( i=0; i<N; i++ ) { pthread_join (pts[i], NULL); } // close SPE program spe_image_close(program); // destroy SPE contexts for ( i=0; i<N; i++ ) { spe_context_destroy (spe[i]); } return 0; }

CISC 879 : Software Support for Multicore Architectures Multi-threaded (Hello World) #include #define N 4 struct thread_args { struct spe_context * spe; void * argp; void * envp; }; void my_spe_thread(struct thread_args *arg) { unsigned int runflags = 0; unsigned int entry = SPE_DEFAULT_ENTRY; // run SPE context spe_context_run(arg->spe, &entry, runflags, arg- >argp, arg->envp, NULL); // done - now exit thread pthread_exit(NULL); } int main() { pthread_t pts[N]; spe_context_ptr_t spe[N]; struct thread_args t_args[N]; int value[N]; int i; spe_program_handle_t * program; // open SPE program program = spe_image_open("hello"); for ( i=0; i<N; i++ ) { // create SPE context spe[i] = spe_context_create(0, NULL); // load SPE program spe_program_load(spe[i], program); // create pthread t_args[i].spe = spe[i]; t_args[i].argp = &value[i]; t_args[i].envp = NULL; pthread_create(&pts[i], NULL, &my_spe_thread, t_args[i]); } // wait for all threads to finish for ( i=0; i<N; i++ ) { pthread_join (pts[i], NULL); } // close SPE program spe_image_close(program); // destroy SPE contexts for ( i=0; i<N; i++ ) { spe_context_destroy (spe[i]); } return 0; }

CISC 879 : Software Support for Multicore Architectures Multi-threaded (Hello World) #include #define N 4 struct thread_args { struct spe_context * spe; void * argp; void * envp; }; void my_spe_thread(struct thread_args *arg) { unsigned int runflags = 0; unsigned int entry = SPE_DEFAULT_ENTRY; // run SPE context spe_context_run(arg->spe, &entry, runflags, arg- >argp, arg->envp, NULL); // done - now exit thread pthread_exit(NULL); } int main() { pthread_t pts[N]; spe_context_ptr_t spe[N]; struct thread_args t_args[N]; int value[N]; int i; spe_program_handle_t * program; // open SPE program program = spe_image_open("hello"); for ( i=0; i<N; i++ ) { // create SPE context spe[i] = spe_context_create(0, NULL); // load SPE program spe_program_load(spe[i], program); // create pthread t_args[i].spe = spe[i]; t_args[i].argp = &value[i]; t_args[i].envp = NULL; pthread_create(&pts[i], NULL, &my_spe_thread, t_args[i]); } // wait for all threads to finish for ( i=0; i<N; i++ ) { pthread_join (pts[i], NULL); } // close SPE program spe_image_close(program); // destroy SPE contexts for ( i=0; i<N; i++ ) { spe_context_destroy (spe[i]); } return 0; }

CISC 879 : Software Support for Multicore Architectures Multi-threaded (Hello World) #include #define N 4 struct thread_args { struct spe_context * spe; void * argp; void * envp; }; void my_spe_thread(struct thread_args *arg) { unsigned int runflags = 0; unsigned int entry = SPE_DEFAULT_ENTRY; // run SPE context spe_context_run(arg->spe, &entry, runflags, arg- >argp, arg->envp, NULL); // done - now exit thread pthread_exit(NULL); } int main() { pthread_t pts[N]; spe_context_ptr_t spe[N]; struct thread_args t_args[N]; int value[N]; int i; spe_program_handle_t * program; // open SPE program program = spe_image_open("hello"); for ( i=0; i<N; i++ ) { // create SPE context spe[i] = spe_context_create(0, NULL); // load SPE program spe_program_load(spe[i], program); // create pthread t_args[i].spe = spe[i]; t_args[i].argp = &value[i]; t_args[i].envp = NULL; pthread_create(&pts[i], NULL, &my_spe_thread, t_args[i]); } // wait for all threads to finish for ( i=0; i<N; i++ ) { pthread_join (pts[i], NULL); } // close SPE program spe_image_close(program); // destroy SPE contexts for ( i=0; i<N; i++ ) { spe_context_destroy (spe[i]); } return 0; }

CISC 879 : Software Support for Multicore Architectures Communication Mechanisms DMA transfers Moves data and instructions from main storage to LS Mailboxes Communication between SPE and PPE or other devices Hold 32-bit messages 2 mailboxes for sending (1 entry each) 1 mailbox for receiving (4 entries) Signal notification 32-bit registers

CISC 879 : Software Support for Multicore Architectures DMA Get/Set Commands Data moved to/from effective address to local store Effective address typically is in main memory, but can be other LS mfc_put(lsaddr,ea,size,tag,tid,rid) mfc_get(lsaddr,ea,size,tag,tid,rid) lsaddr : target address in SPU local store ea : effective address, i.e main memory address (64 bits) size: size transfer in bytes tag: tag to identify this transfer, 16 different tags available tid : transfer-class id rid: replacement-class id

CISC 879 : Software Support for Multicore Architectures DMA Read into Local Store inline void dma_mem_to_ls(unsigned int mem_addr, volatile void *ls_addr,unsigned int size) { unsigned int tag = 0; unsigned int mask = 1; mfc_get(ls_addr,mem_addr,size,tag,0,0); mfc_write_tag_mask(mask); mfc_read_tag_status_all(); } Set tag mask Wait for all tag DMA completed Read contents of mem_addr into ls_addr

CISC 879 : Software Support for Multicore Architectures DMA Write to Main Memory inline void dma_ls_to_mem(unsigned int mem_addr,volatile void *ls_addr, unsigned int size) { unsigned int tag = 0; unsigned int mask = 1; mfc_put(ls_addr,mem_addr,size,tag,0,0); mfc_write_tag_mask(mask); mfc_read_tag_status_all(); } Write contents of mem_addr into ls_addr Set tag mask

CISC 879 : Software Support for Multicore Architectures Double Buffer Example Time I Buf 1 (n)O Buf 1 (n) I Buf 2 (n+1)O Buf 2 (n-1) SPE program: Func (n) output n-2 input n Output n-1 Func (input n ) Input n+1 Func (input n+1 )Func (input n-1 ) output n Input n+2 DMAs SPE exec. DMAs SPE exec. Handling DMA latency is critical to overall performance Data prefetching is a key technique to hide DMA latency

CISC 879 : Software Support for Multicore Architectures Double Buffer Example /* Example C code demonstrating double buffering using buffers B[0] and B[1]. In this example, an array of data starting at the effective address ea is DMAed into the SPU's local store in 4 KB chunks and processed by the use_data subroutine. */ #include #define BUFFER_SIZE 4096 volatile unsigned char B[2][BUFFER_SIZE] __attribute__ ((aligned(128))); void double_buffer_example(unsigned int ea, int buffers) { int next_idx, idx = 0; // Initiate first DMA transfer spu_mfcdma32(B[idx], ea, BUFFER_SIZE, idx, MFC_GET_CMD); ea += BUFFER_SIZE; while (--buffers) { next_idx = idx ^ 1; // toggle buffer index spu_mfcdma32(B[next_idx], ea, BUFFER_SIZE, idx, MFC_GET_CMD); ea += BUFFER_SIZE; spu_writech(MFC_WrTagMask, 1 << idx); // Wait for previous transfer done (void)spu_mfcstat(MFC_TAG_UPDATE_ALL); use_data(B[idx]); // Use the previous data idx = next_idx; } spu_writech(MFC_WrTagMask, 1 << idx); // Wait for last transfer done (void)spu_mfcstat(MFC_TAG_UPDATE_ALL); use_data(B[idx]); // Use the last data

CISC 879 : Software Support for Multicore Architectures Double Buffer Example /* Example C code demonstrating double buffering using buffers B[0] and B[1]. In this example, an array of data starting at the effective address ea is DMAed into the SPU's local store in 4 KB chunks and processed by the use_data subroutine. */ #include #define BUFFER_SIZE 4096 volatile unsigned char B[2][BUFFER_SIZE] __attribute__ ((aligned(128))); void double_buffer_example(unsigned int ea, int buffers) { int next_idx, idx = 0; // Initiate first DMA transfer spu_mfcdma32(B[idx], ea, BUFFER_SIZE, idx, MFC_GET_CMD); ea += BUFFER_SIZE; while (--buffers) { next_idx = idx ^ 1; // toggle buffer index spu_mfcdma32(B[next_idx], ea, BUFFER_SIZE, idx, MFC_GET_CMD); ea += BUFFER_SIZE; spu_writech(MFC_WrTagMask, 1 << idx); // Wait for previous transfer done (void)spu_mfcstat(MFC_TAG_UPDATE_ALL); use_data(B[idx]); // Use the previous data idx = next_idx; } spu_writech(MFC_WrTagMask, 1 << idx); // Wait for last transfer done (void)spu_mfcstat(MFC_TAG_UPDATE_ALL); use_data(B[idx]); // Use the last data

CISC 879 : Software Support for Multicore Architectures Double Buffer Example /* Example C code demonstrating double buffering using buffers B[0] and B[1]. In this example, an array of data starting at the effective address ea is DMAed into the SPU's local store in 4 KB chunks and processed by the use_data subroutine. */ #include #define BUFFER_SIZE 4096 volatile unsigned char B[2][BUFFER_SIZE] __attribute__ ((aligned(128))); void double_buffer_example(unsigned int ea, int buffers) { int next_idx, idx = 0; // Initiate first DMA transfer spu_mfcdma32(B[idx], ea, BUFFER_SIZE, idx, MFC_GET_CMD); ea += BUFFER_SIZE; while (--buffers) { next_idx = idx ^ 1; // toggle buffer index spu_mfcdma32(B[next_idx], ea, BUFFER_SIZE, idx, MFC_GET_CMD); ea += BUFFER_SIZE; spu_writech(MFC_WrTagMask, 1 << idx); // Wait for previous transfer done (void)spu_mfcstat(MFC_TAG_UPDATE_ALL); use_data(B[idx]); // Use the previous data idx = next_idx; } spu_writech(MFC_WrTagMask, 1 << idx); // Wait for last transfer done (void)spu_mfcstat(MFC_TAG_UPDATE_ALL); use_data(B[idx]); // Use the last data

CISC 879 : Software Support for Multicore Architectures Double Buffer Example /* Example C code demonstrating double buffering using buffers B[0] and B[1]. In this example, an array of data starting at the effective address ea is DMAed into the SPU's local store in 4 KB chunks and processed by the use_data subroutine. */ #include #define BUFFER_SIZE 4096 volatile unsigned char B[2][BUFFER_SIZE] __attribute__ ((aligned(128))); void double_buffer_example(unsigned int ea, int buffers) { int next_idx, idx = 0; // Initiate first DMA transfer spu_mfcdma32(B[idx], ea, BUFFER_SIZE, idx, MFC_GET_CMD); ea += BUFFER_SIZE; while (--buffers) { next_idx = idx ^ 1; // toggle buffer index spu_mfcdma32(B[next_idx], ea, BUFFER_SIZE, idx, MFC_GET_CMD); ea += BUFFER_SIZE; spu_writech(MFC_WrTagMask, 1 << idx); // Wait for previous transfer done (void)spu_mfcstat(MFC_TAG_UPDATE_ALL); use_data(B[idx]); // Use the previous data idx = next_idx; } spu_writech(MFC_WrTagMask, 1 << idx); // Wait for last transfer done (void)spu_mfcstat(MFC_TAG_UPDATE_ALL); use_data(B[idx]); // Use the last data

CISC 879 : Software Support for Multicore Architectures Double Buffer Example /* Example C code demonstrating double buffering using buffers B[0] and B[1]. In this example, an array of data starting at the effective address ea is DMAed into the SPU's local store in 4 KB chunks and processed by the use_data subroutine. */ #include #define BUFFER_SIZE 4096 volatile unsigned char B[2][BUFFER_SIZE] __attribute__ ((aligned(128))); void double_buffer_example(unsigned int ea, int buffers) { int next_idx, idx = 0; // Initiate first DMA transfer spu_mfcdma32(B[idx], ea, BUFFER_SIZE, idx, MFC_GET_CMD); ea += BUFFER_SIZE; while (--buffers) { next_idx = idx ^ 1; // toggle buffer index spu_mfcdma32(B[next_idx], ea, BUFFER_SIZE, idx, MFC_GET_CMD); ea += BUFFER_SIZE; spu_writech(MFC_WrTagMask, 1 << idx); // Wait for previous transfer done (void)spu_mfcstat(MFC_TAG_UPDATE_ALL); use_data(B[idx]); // Use the previous data idx = next_idx; } spu_writech(MFC_WrTagMask, 1 << idx); // Wait for last transfer done (void)spu_mfcstat(MFC_TAG_UPDATE_ALL); use_data(B[idx]); // Use the last data

CISC 879 : Software Support for Multicore Architectures Double Buffer Example /* Example C code demonstrating double buffering using buffers B[0] and B[1]. In this example, an array of data starting at the effective address ea is DMAed into the SPU's local store in 4 KB chunks and processed by the use_data subroutine. */ #include #define BUFFER_SIZE 4096 volatile unsigned char B[2][BUFFER_SIZE] __attribute__ ((aligned(128))); void double_buffer_example(unsigned int ea, int buffers) { int next_idx, idx = 0; // Initiate first DMA transfer spu_mfcdma32(B[idx], ea, BUFFER_SIZE, idx, MFC_GET_CMD); ea += BUFFER_SIZE; while (--buffers) { next_idx = idx ^ 1; // toggle buffer index spu_mfcdma32(B[next_idx], ea, BUFFER_SIZE, idx, MFC_GET_CMD); ea += BUFFER_SIZE; spu_writech(MFC_WrTagMask, 1 << idx); (void)spu_mfcstat(MFC_TAG_UPDATE_ALL); // Wait for previous transfer done use_data(B[idx]); // Use the previous data idx = next_idx; } spu_writech(MFC_WrTagMask, 1 << idx); // Wait for last transfer done (void)spu_mfcstat(MFC_TAG_UPDATE_ALL); use_data(B[idx]); // Use the last data

CISC 879 : Software Support for Multicore Architectures Mailboxes Communicate messages up to 32-bits in length E.g., buffer completion flags or program status E.g., when SPE places results in main storage via DMA SPE can wait until DMA transfer completes then writes to outbound mailbox to notify PPE Short-data transfers Storage addresses, function parameters Can be used to communicate between SPEs, PPE, or other devices Priviledged software needs to allow one SPE to access mailbox register in another SPE