CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware Lecture 9 Cell Programming Tutorial
CISC 879 : Software Support for Multicore Architectures Lecture 9: Overview Cell Basics Programming Models Programming Details Example Code
CISC 879 : Software Support for Multicore Architectures Cell Architecture Recap Heterogeneous architecture 9 cores on chip 1 PPE (General Purpose Processor) 8 SPEs (SIMD processors) PPEs runs control-plane code Code with lots of brances (e.g., OS) SPEs runs data-plane code Computational code with little branches
CISC 879 : Software Support for Multicore Architectures Program Structure Multiple programs in one PPU and SPU programs cooperate PPE Code Regular Linux process (main thread) Process can spawn SPE threads SPE Code SPE executables are packaged inside PPE executables
CISC 879 : Software Support for Multicore Architectures SPE Details Register file: Large (128 entries), 128-bit, and unified All instructions are SIMD instructions Local Store (256 KB) Loads/Stores access LS Contains all Instructions/Data used by SPU DMA transfers data between LS and main storage High bandwidth (128 bytes per cycle) Eliminate non-deterministic features Out-of-order execution Hardware-managed caches Hardware branch prediction
CISC 879 : Software Support for Multicore Architectures SPE Register Layout The left-most word (bytes 0, 1, 2, and 3) of a register is called the preferred slot
CISC 879 : Software Support for Multicore Architectures SPE SIMD Example (add) Example is a 4-wide add – each of the 4 elements in reg VA is added to the corresponding element in reg VB – the 4 results are placed in the appropriate slots in reg VC
CISC 879 : Software Support for Multicore Architectures SPE SIMD Example (shuffle) Bytes selected from regs VA and VB based on control vector VC Control vector entries are indices of VA and VB Operation is purely byte oriented
CISC 879 : Software Support for Multicore Architectures SPE model Code and data must fit into 256-KB local store Explicit input/output of SPE program Program arguments and return code DMA Mailboxes Signals SPE Program PPE maps system memory for SPE DMA Local Store DMA transactions System Memory
CISC 879 : Software Support for Multicore Architectures SPE Model (cont’d) Streaming model for large size input/output data int g_ip[512*1024] System memory int g_op[512*1024] int ip[32] int op[32] SPE program: op = func(ip) DMA Local store
CISC 879 : Software Support for Multicore Architectures Programming Models How application/data partitioned among PPEs/SPEs Partitioning involves considering Program structure Data structures Data and code via DMA Several models: Data-parallel Task-Parallel Job Queue
CISC 879 : Software Support for Multicore Architectures Job Queue Code and data packaged together Job queue System memory Local store code/data n code/data n+1 code/data n+2 code/data … Code n Data n SPE kernel DMA
CISC 879 : Software Support for Multicore Architectures Data Parallel SPE initiated DMA Large array of data fed through SPEs Special case of Job Queue PPE SPE1 Kernel() SPE0 Kernel() SPE7 Kernel() System Memory InIn. I7I7 I6I6 I5I5 I4I4 I3I3 I2I2 I1I1 I0I0 OnOn. O7O7 O6O6 O5O5 O4O4 O3O3 O2O2 O1O1 O0O0 ….. Data-parallel
CISC 879 : Software Support for Multicore Architectures Task Parallel LS to LS DMA Flexible in pipeline functions Load balancing harder PPE SPE1 Kernel() SPE0 Kernel() SPE7 Kernel() System Memory InIn. I7I7 I6I6 I5I5 I4I4 I3I3 I2I2 I1I1 I0I0 OnOn. O7O7 O6O6 O5O5 O4O4 O3O3 O2O2 O1O1 O0O0 ….. DMA Task-parallel
CISC 879 : Software Support for Multicore Architectures Cell Terminology SPE Context Holds information about Logical SPE Used by Application SPE Gang Context Group of threads with same properties SPE Event Events caused by (asynchronously) running SPE threads Examples: SPE execution stopped, Mailbox messages written/read, DMA operations completed
CISC 879 : Software Support for Multicore Architectures LibSPE Version 2.0 SPE Context events PPE Thread Function PPE Thread policy priority SPE Stopinfo SPE Gang Context spe_program_handle_t spe_context_ptr_t SPE Program arguments environment Application Data argpenvp pthread_t application_data_t Conceptual Diagram
CISC 879 : Software Support for Multicore Architectures Single SPE Thread A simple application uses a single PPE thread Basic scheme for simple application using SPE: 1. Create an SPE context 2. Load executable object into the SPE context’s local store 3. Run SPE context Transfers control to OS requesting scheduling of the context to a physical SPE in the system 4. Destroy SPE context Note: Step 3 represents a synchronous call. Calling Application thread blocks until the SPE stops and returns.
CISC 879 : Software Support for Multicore Architectures Single Thread (Hello World) PPU #include int main() { spe_context_ptr_t spe; unsigned int createflags = 0; unsigned int runflags = 0; unsigned int entry = SPE_DEFAULT_ENTRY; void * argp = NULL; void * envp = NULL; spe_program_handle_t * program; program = spe_image_open("hello"); spe = spe_context_create(createflags, NULL); spe_program_load(spe, program); spe_context_run(spe, &entry, runflags, argp, envp, NULL); spe_image_close(program); spe_context_destroy(spe); } SPU #include int main() { printf("hello world\n"); return 0; }
CISC 879 : Software Support for Multicore Architectures Multiple SPE Threads May want multiple SPEs concurrently Create N PPE threads for N concurrent SPE contexts Each PPE thread runs single SPE context Basic for simple application running N SPE contexts 1. Create N SPE contexts 2. Load SPE executable into each SPE context’s local store 3. Create N PPE threads - In each PPE thread run one SPE context - Terminate PPE thread 4. Wait for all N PPE threads to terminate 5. Destroy all N SPE contexts
CISC 879 : Software Support for Multicore Architectures Multi-threaded (Hello World) #include #define N 4 struct thread_args { struct spe_context * spe; void * argp; void * envp; }; void my_spe_thread(struct thread_args *arg) { unsigned int runflags = 0; unsigned int entry = SPE_DEFAULT_ENTRY; // run SPE context spe_context_run(arg->spe, &entry, runflags, arg- >argp, arg->envp, NULL); // done - now exit thread pthread_exit(NULL); } int main() { pthread_t pts[N]; spe_context_ptr_t spe[N]; struct thread_args t_args[N]; int value[N]; int i; spe_program_handle_t * program; // open SPE program program = spe_image_open("hello"); for ( i=0; i<N; i++ ) { // create SPE context spe[i] = spe_context_create(0, NULL); // load SPE program spe_program_load(spe[i], program); // create pthread t_args[i].spe = spe[i]; t_args[i].argp = &value[i]; t_args[i].envp = NULL; pthread_create(&pts[i], NULL, &my_spe_thread, t_args[i]); } // wait for all threads to finish for ( i=0; i<N; i++ ) { pthread_join (pts[i], NULL); } // close SPE program spe_image_close(program); // destroy SPE contexts for ( i=0; i<N; i++ ) { spe_context_destroy (spe[i]); } return 0; }
CISC 879 : Software Support for Multicore Architectures Multi-threaded (Hello World) #include #define N 4 struct thread_args { struct spe_context * spe; void * argp; void * envp; }; void my_spe_thread(struct thread_args *arg) { unsigned int runflags = 0; unsigned int entry = SPE_DEFAULT_ENTRY; // run SPE context spe_context_run(arg->spe, &entry, runflags, arg- >argp, arg->envp, NULL); // done - now exit thread pthread_exit(NULL); } int main() { pthread_t pts[N]; spe_context_ptr_t spe[N]; struct thread_args t_args[N]; int value[N]; int i; spe_program_handle_t * program; // open SPE program program = spe_image_open("hello"); for ( i=0; i<N; i++ ) { // create SPE context spe[i] = spe_context_create(0, NULL); // load SPE program spe_program_load(spe[i], program); // create pthread t_args[i].spe = spe[i]; t_args[i].argp = &value[i]; t_args[i].envp = NULL; pthread_create(&pts[i], NULL, &my_spe_thread, t_args[i]); } // wait for all threads to finish for ( i=0; i<N; i++ ) { pthread_join (pts[i], NULL); } // close SPE program spe_image_close(program); // destroy SPE contexts for ( i=0; i<N; i++ ) { spe_context_destroy (spe[i]); } return 0; }
CISC 879 : Software Support for Multicore Architectures Multi-threaded (Hello World) #include #define N 4 struct thread_args { struct spe_context * spe; void * argp; void * envp; }; void my_spe_thread(struct thread_args *arg) { unsigned int runflags = 0; unsigned int entry = SPE_DEFAULT_ENTRY; // run SPE context spe_context_run(arg->spe, &entry, runflags, arg- >argp, arg->envp, NULL); // done - now exit thread pthread_exit(NULL); } int main() { pthread_t pts[N]; spe_context_ptr_t spe[N]; struct thread_args t_args[N]; int value[N]; int i; spe_program_handle_t * program; // open SPE program program = spe_image_open("hello"); for ( i=0; i<N; i++ ) { // create SPE context spe[i] = spe_context_create(0, NULL); // load SPE program spe_program_load(spe[i], program); // create pthread t_args[i].spe = spe[i]; t_args[i].argp = &value[i]; t_args[i].envp = NULL; pthread_create(&pts[i], NULL, &my_spe_thread, t_args[i]); } // wait for all threads to finish for ( i=0; i<N; i++ ) { pthread_join (pts[i], NULL); } // close SPE program spe_image_close(program); // destroy SPE contexts for ( i=0; i<N; i++ ) { spe_context_destroy (spe[i]); } return 0; }
CISC 879 : Software Support for Multicore Architectures Multi-threaded (Hello World) #include #define N 4 struct thread_args { struct spe_context * spe; void * argp; void * envp; }; void my_spe_thread(struct thread_args *arg) { unsigned int runflags = 0; unsigned int entry = SPE_DEFAULT_ENTRY; // run SPE context spe_context_run(arg->spe, &entry, runflags, arg- >argp, arg->envp, NULL); // done - now exit thread pthread_exit(NULL); } int main() { pthread_t pts[N]; spe_context_ptr_t spe[N]; struct thread_args t_args[N]; int value[N]; int i; spe_program_handle_t * program; // open SPE program program = spe_image_open("hello"); for ( i=0; i<N; i++ ) { // create SPE context spe[i] = spe_context_create(0, NULL); // load SPE program spe_program_load(spe[i], program); // create pthread t_args[i].spe = spe[i]; t_args[i].argp = &value[i]; t_args[i].envp = NULL; pthread_create(&pts[i], NULL, &my_spe_thread, t_args[i]); } // wait for all threads to finish for ( i=0; i<N; i++ ) { pthread_join (pts[i], NULL); } // close SPE program spe_image_close(program); // destroy SPE contexts for ( i=0; i<N; i++ ) { spe_context_destroy (spe[i]); } return 0; }
CISC 879 : Software Support for Multicore Architectures Multi-threaded (Hello World) #include #define N 4 struct thread_args { struct spe_context * spe; void * argp; void * envp; }; void my_spe_thread(struct thread_args *arg) { unsigned int runflags = 0; unsigned int entry = SPE_DEFAULT_ENTRY; // run SPE context spe_context_run(arg->spe, &entry, runflags, arg- >argp, arg->envp, NULL); // done - now exit thread pthread_exit(NULL); } int main() { pthread_t pts[N]; spe_context_ptr_t spe[N]; struct thread_args t_args[N]; int value[N]; int i; spe_program_handle_t * program; // open SPE program program = spe_image_open("hello"); for ( i=0; i<N; i++ ) { // create SPE context spe[i] = spe_context_create(0, NULL); // load SPE program spe_program_load(spe[i], program); // create pthread t_args[i].spe = spe[i]; t_args[i].argp = &value[i]; t_args[i].envp = NULL; pthread_create(&pts[i], NULL, &my_spe_thread, t_args[i]); } // wait for all threads to finish for ( i=0; i<N; i++ ) { pthread_join (pts[i], NULL); } // close SPE program spe_image_close(program); // destroy SPE contexts for ( i=0; i<N; i++ ) { spe_context_destroy (spe[i]); } return 0; }
CISC 879 : Software Support for Multicore Architectures Multi-threaded (Hello World) #include #define N 4 struct thread_args { struct spe_context * spe; void * argp; void * envp; }; void my_spe_thread(struct thread_args *arg) { unsigned int runflags = 0; unsigned int entry = SPE_DEFAULT_ENTRY; // run SPE context spe_context_run(arg->spe, &entry, runflags, arg- >argp, arg->envp, NULL); // done - now exit thread pthread_exit(NULL); } int main() { pthread_t pts[N]; spe_context_ptr_t spe[N]; struct thread_args t_args[N]; int value[N]; int i; spe_program_handle_t * program; // open SPE program program = spe_image_open("hello"); for ( i=0; i<N; i++ ) { // create SPE context spe[i] = spe_context_create(0, NULL); // load SPE program spe_program_load(spe[i], program); // create pthread t_args[i].spe = spe[i]; t_args[i].argp = &value[i]; t_args[i].envp = NULL; pthread_create(&pts[i], NULL, &my_spe_thread, t_args[i]); } // wait for all threads to finish for ( i=0; i<N; i++ ) { pthread_join (pts[i], NULL); } // close SPE program spe_image_close(program); // destroy SPE contexts for ( i=0; i<N; i++ ) { spe_context_destroy (spe[i]); } return 0; }
CISC 879 : Software Support for Multicore Architectures Multi-threaded (Hello World) #include #define N 4 struct thread_args { struct spe_context * spe; void * argp; void * envp; }; void my_spe_thread(struct thread_args *arg) { unsigned int runflags = 0; unsigned int entry = SPE_DEFAULT_ENTRY; // run SPE context spe_context_run(arg->spe, &entry, runflags, arg- >argp, arg->envp, NULL); // done - now exit thread pthread_exit(NULL); } int main() { pthread_t pts[N]; spe_context_ptr_t spe[N]; struct thread_args t_args[N]; int value[N]; int i; spe_program_handle_t * program; // open SPE program program = spe_image_open("hello"); for ( i=0; i<N; i++ ) { // create SPE context spe[i] = spe_context_create(0, NULL); // load SPE program spe_program_load(spe[i], program); // create pthread t_args[i].spe = spe[i]; t_args[i].argp = &value[i]; t_args[i].envp = NULL; pthread_create(&pts[i], NULL, &my_spe_thread, t_args[i]); } // wait for all threads to finish for ( i=0; i<N; i++ ) { pthread_join (pts[i], NULL); } // close SPE program spe_image_close(program); // destroy SPE contexts for ( i=0; i<N; i++ ) { spe_context_destroy (spe[i]); } return 0; }
CISC 879 : Software Support for Multicore Architectures Communication Mechanisms DMA transfers Moves data and instructions from main storage to LS Mailboxes Communication between SPE and PPE or other devices Hold 32-bit messages 2 mailboxes for sending (1 entry each) 1 mailbox for receiving (4 entries) Signal notification 32-bit registers
CISC 879 : Software Support for Multicore Architectures DMA Get/Set Commands Data moved to/from effective address to local store Effective address typically is in main memory, but can be other LS mfc_put(lsaddr,ea,size,tag,tid,rid) mfc_get(lsaddr,ea,size,tag,tid,rid) lsaddr : target address in SPU local store ea : effective address, i.e main memory address (64 bits) size: size transfer in bytes tag: tag to identify this transfer, 16 different tags available tid : transfer-class id rid: replacement-class id
CISC 879 : Software Support for Multicore Architectures DMA Read into Local Store inline void dma_mem_to_ls(unsigned int mem_addr, volatile void *ls_addr,unsigned int size) { unsigned int tag = 0; unsigned int mask = 1; mfc_get(ls_addr,mem_addr,size,tag,0,0); mfc_write_tag_mask(mask); mfc_read_tag_status_all(); } Set tag mask Wait for all tag DMA completed Read contents of mem_addr into ls_addr
CISC 879 : Software Support for Multicore Architectures DMA Write to Main Memory inline void dma_ls_to_mem(unsigned int mem_addr,volatile void *ls_addr, unsigned int size) { unsigned int tag = 0; unsigned int mask = 1; mfc_put(ls_addr,mem_addr,size,tag,0,0); mfc_write_tag_mask(mask); mfc_read_tag_status_all(); } Write contents of mem_addr into ls_addr Set tag mask
CISC 879 : Software Support for Multicore Architectures Double Buffer Example Time I Buf 1 (n)O Buf 1 (n) I Buf 2 (n+1)O Buf 2 (n-1) SPE program: Func (n) output n-2 input n Output n-1 Func (input n ) Input n+1 Func (input n+1 )Func (input n-1 ) output n Input n+2 DMAs SPE exec. DMAs SPE exec. Handling DMA latency is critical to overall performance Data prefetching is a key technique to hide DMA latency
CISC 879 : Software Support for Multicore Architectures Double Buffer Example /* Example C code demonstrating double buffering using buffers B[0] and B[1]. In this example, an array of data starting at the effective address ea is DMAed into the SPU's local store in 4 KB chunks and processed by the use_data subroutine. */ #include #define BUFFER_SIZE 4096 volatile unsigned char B[2][BUFFER_SIZE] __attribute__ ((aligned(128))); void double_buffer_example(unsigned int ea, int buffers) { int next_idx, idx = 0; // Initiate first DMA transfer spu_mfcdma32(B[idx], ea, BUFFER_SIZE, idx, MFC_GET_CMD); ea += BUFFER_SIZE; while (--buffers) { next_idx = idx ^ 1; // toggle buffer index spu_mfcdma32(B[next_idx], ea, BUFFER_SIZE, idx, MFC_GET_CMD); ea += BUFFER_SIZE; spu_writech(MFC_WrTagMask, 1 << idx); // Wait for previous transfer done (void)spu_mfcstat(MFC_TAG_UPDATE_ALL); use_data(B[idx]); // Use the previous data idx = next_idx; } spu_writech(MFC_WrTagMask, 1 << idx); // Wait for last transfer done (void)spu_mfcstat(MFC_TAG_UPDATE_ALL); use_data(B[idx]); // Use the last data
CISC 879 : Software Support for Multicore Architectures Double Buffer Example /* Example C code demonstrating double buffering using buffers B[0] and B[1]. In this example, an array of data starting at the effective address ea is DMAed into the SPU's local store in 4 KB chunks and processed by the use_data subroutine. */ #include #define BUFFER_SIZE 4096 volatile unsigned char B[2][BUFFER_SIZE] __attribute__ ((aligned(128))); void double_buffer_example(unsigned int ea, int buffers) { int next_idx, idx = 0; // Initiate first DMA transfer spu_mfcdma32(B[idx], ea, BUFFER_SIZE, idx, MFC_GET_CMD); ea += BUFFER_SIZE; while (--buffers) { next_idx = idx ^ 1; // toggle buffer index spu_mfcdma32(B[next_idx], ea, BUFFER_SIZE, idx, MFC_GET_CMD); ea += BUFFER_SIZE; spu_writech(MFC_WrTagMask, 1 << idx); // Wait for previous transfer done (void)spu_mfcstat(MFC_TAG_UPDATE_ALL); use_data(B[idx]); // Use the previous data idx = next_idx; } spu_writech(MFC_WrTagMask, 1 << idx); // Wait for last transfer done (void)spu_mfcstat(MFC_TAG_UPDATE_ALL); use_data(B[idx]); // Use the last data
CISC 879 : Software Support for Multicore Architectures Double Buffer Example /* Example C code demonstrating double buffering using buffers B[0] and B[1]. In this example, an array of data starting at the effective address ea is DMAed into the SPU's local store in 4 KB chunks and processed by the use_data subroutine. */ #include #define BUFFER_SIZE 4096 volatile unsigned char B[2][BUFFER_SIZE] __attribute__ ((aligned(128))); void double_buffer_example(unsigned int ea, int buffers) { int next_idx, idx = 0; // Initiate first DMA transfer spu_mfcdma32(B[idx], ea, BUFFER_SIZE, idx, MFC_GET_CMD); ea += BUFFER_SIZE; while (--buffers) { next_idx = idx ^ 1; // toggle buffer index spu_mfcdma32(B[next_idx], ea, BUFFER_SIZE, idx, MFC_GET_CMD); ea += BUFFER_SIZE; spu_writech(MFC_WrTagMask, 1 << idx); // Wait for previous transfer done (void)spu_mfcstat(MFC_TAG_UPDATE_ALL); use_data(B[idx]); // Use the previous data idx = next_idx; } spu_writech(MFC_WrTagMask, 1 << idx); // Wait for last transfer done (void)spu_mfcstat(MFC_TAG_UPDATE_ALL); use_data(B[idx]); // Use the last data
CISC 879 : Software Support for Multicore Architectures Double Buffer Example /* Example C code demonstrating double buffering using buffers B[0] and B[1]. In this example, an array of data starting at the effective address ea is DMAed into the SPU's local store in 4 KB chunks and processed by the use_data subroutine. */ #include #define BUFFER_SIZE 4096 volatile unsigned char B[2][BUFFER_SIZE] __attribute__ ((aligned(128))); void double_buffer_example(unsigned int ea, int buffers) { int next_idx, idx = 0; // Initiate first DMA transfer spu_mfcdma32(B[idx], ea, BUFFER_SIZE, idx, MFC_GET_CMD); ea += BUFFER_SIZE; while (--buffers) { next_idx = idx ^ 1; // toggle buffer index spu_mfcdma32(B[next_idx], ea, BUFFER_SIZE, idx, MFC_GET_CMD); ea += BUFFER_SIZE; spu_writech(MFC_WrTagMask, 1 << idx); // Wait for previous transfer done (void)spu_mfcstat(MFC_TAG_UPDATE_ALL); use_data(B[idx]); // Use the previous data idx = next_idx; } spu_writech(MFC_WrTagMask, 1 << idx); // Wait for last transfer done (void)spu_mfcstat(MFC_TAG_UPDATE_ALL); use_data(B[idx]); // Use the last data
CISC 879 : Software Support for Multicore Architectures Double Buffer Example /* Example C code demonstrating double buffering using buffers B[0] and B[1]. In this example, an array of data starting at the effective address ea is DMAed into the SPU's local store in 4 KB chunks and processed by the use_data subroutine. */ #include #define BUFFER_SIZE 4096 volatile unsigned char B[2][BUFFER_SIZE] __attribute__ ((aligned(128))); void double_buffer_example(unsigned int ea, int buffers) { int next_idx, idx = 0; // Initiate first DMA transfer spu_mfcdma32(B[idx], ea, BUFFER_SIZE, idx, MFC_GET_CMD); ea += BUFFER_SIZE; while (--buffers) { next_idx = idx ^ 1; // toggle buffer index spu_mfcdma32(B[next_idx], ea, BUFFER_SIZE, idx, MFC_GET_CMD); ea += BUFFER_SIZE; spu_writech(MFC_WrTagMask, 1 << idx); // Wait for previous transfer done (void)spu_mfcstat(MFC_TAG_UPDATE_ALL); use_data(B[idx]); // Use the previous data idx = next_idx; } spu_writech(MFC_WrTagMask, 1 << idx); // Wait for last transfer done (void)spu_mfcstat(MFC_TAG_UPDATE_ALL); use_data(B[idx]); // Use the last data
CISC 879 : Software Support for Multicore Architectures Double Buffer Example /* Example C code demonstrating double buffering using buffers B[0] and B[1]. In this example, an array of data starting at the effective address ea is DMAed into the SPU's local store in 4 KB chunks and processed by the use_data subroutine. */ #include #define BUFFER_SIZE 4096 volatile unsigned char B[2][BUFFER_SIZE] __attribute__ ((aligned(128))); void double_buffer_example(unsigned int ea, int buffers) { int next_idx, idx = 0; // Initiate first DMA transfer spu_mfcdma32(B[idx], ea, BUFFER_SIZE, idx, MFC_GET_CMD); ea += BUFFER_SIZE; while (--buffers) { next_idx = idx ^ 1; // toggle buffer index spu_mfcdma32(B[next_idx], ea, BUFFER_SIZE, idx, MFC_GET_CMD); ea += BUFFER_SIZE; spu_writech(MFC_WrTagMask, 1 << idx); (void)spu_mfcstat(MFC_TAG_UPDATE_ALL); // Wait for previous transfer done use_data(B[idx]); // Use the previous data idx = next_idx; } spu_writech(MFC_WrTagMask, 1 << idx); // Wait for last transfer done (void)spu_mfcstat(MFC_TAG_UPDATE_ALL); use_data(B[idx]); // Use the last data
CISC 879 : Software Support for Multicore Architectures Mailboxes Communicate messages up to 32-bits in length E.g., buffer completion flags or program status E.g., when SPE places results in main storage via DMA SPE can wait until DMA transfer completes then writes to outbound mailbox to notify PPE Short-data transfers Storage addresses, function parameters Can be used to communicate between SPEs, PPE, or other devices Priviledged software needs to allow one SPE to access mailbox register in another SPE