Presentation is loading. Please wait.

Presentation is loading. Please wait.

Cell/B.E. Jiří Dokulil. Introduction Cell Broadband Engine developed Sony, Toshiba and IBM 64bit PowerPC PowerPC Processor Element (PPE) runs OS SIMD.

Similar presentations


Presentation on theme: "Cell/B.E. Jiří Dokulil. Introduction Cell Broadband Engine developed Sony, Toshiba and IBM 64bit PowerPC PowerPC Processor Element (PPE) runs OS SIMD."— Presentation transcript:

1 Cell/B.E. Jiří Dokulil

2 Introduction Cell Broadband Engine developed Sony, Toshiba and IBM 64bit PowerPC PowerPC Processor Element (PPE) runs OS SIMD Synergistic Processor Element (SPE) 8x computations, no OS big endian

3 Architecture

4 Memory access PPE load & store cache SPE DMA up to 16 concurrent per SPE no direct access to memory no need for out-of-order processing, no speculation local storage no cache

5 PPE PowerPC Processor Element PPU (PowerPC Processor Unit) PPSS (PowerPC Processor Storage Subsystem) 64-bit, dual-thread PowerPC Architecture RISC core 2x32KB L1 (instructions and data) 512LB L2 (unified) PowerPC instruction set vector/SIMD extensions – different from SPE 32x 128bit vector registers

6 SPE Synergistic Processor Element SPU (Synergistic Processor Unit) MFC (Memory Flow Controller) RISC, SIMD Synergistic Processor Unit Instruction Set Architecture support for DMA and interprocessor messaging 256KB LS 128x128bit register file DMA access to main memory segment and page tables of PPE channels in MFC unidirectional message-passing interfaces memory-mapped I/O (MMIO) registers and queues

7 EIB Element Interconnect Bus four 16-byte-wide data rings transfer 128byte at a time (one PPE cache line) internal bandwidth 96bytes per clock cycle latency depends on the number of hops bus is a ring half frequency of SPU

8 DMA MFCs support naturally aligned DMA transfer sizes of 1, 2, 4, or 8 bytes, and multiples of 16 bytes maximum transfer size of 16 KB per transfer DMA list commands can initiate up to 2048 transfers peak transfer performance if both the effective addresses and the LS addresses are 128-byte aligned and the size of the transfer is an even multiple of 128 bytes SMM (Synergistic Memory Management) unit processes address translation access-permission information data supplied by the PPE operating system

9 SIMD example // 16 iterations of a loop int rolled_sum(unsigned char bytes[16]) { int i; int sum = 0; for (i = 0; i < 16; ++i) { sum += bytes[i]; } return sum; }

10 SIMD example cont. // Vectorized for vector/SIMD multimedia extension int vectorized_sum(unsigned char bytes[16]) { vector unsigned char vbytes; union { int i[4]; vector signed int v; } sum; vector unsigned int zero = (vector unsigned int){0}; // Perform a misaligned vector load of the 16 bytes. vbytes = vec_perm(vec_ld(0, bytes), vec_ld(16, bytes), vec_lvsl(0, bytes)); // Sum the 16 bytes of the vector sum.v = vec_sums((vector signed int)vec_sum4s(vbytes, zero), (vector signed int)zero); // Extract the sum and return the result. return (sum.i[3]); }

11 Communication DMA 2 command queues per SPE one for commands by SPE one for commands by PPE and other SPEs commands have tags (32 different) – status query one transfer or a list mailboxes for each SPE communication with PPE 2 outgoing (1 message) 1 incoming (4 messages) signals 2 inbound channels

12 DMA put, get SPE or PPE initiated tag 5bit ordering out of order barrier – maintains order (within tag group) fence – after all previous (within tag group) simple or lists lists stored in LS (8bytes per item) -> SPE only up to 2048 transfers, 16KB each -> 32MB compare to 256KB LS size

13 DMA – PPE raw access MFC registers mapped to virtual address space void *ps = get_ps(); //get the problem state – must be mapped by privileged software unsigned int ls = 0x500; unsigned int long long ea = 0x10000000; unsigned int size = 0x4000; unsigned int tag = 5; unsigned int classid = 0; unsigned int cmd = MFC_GET_CMD; unsigned int cmd_status; do { *((volatile unsigned int *)(ps + MFC_LSA)) = ls; *((volatile unsigned long long *)(ps + MFC_EAH)) = ea; *((volatile unsigned int *)(ps + MFC_Size)) = (size << 16) | tag; *((volatile unsigned int *)(ps + MFC_ClassID)) = (classid << 16) | cmd; /* Read MFC_CMDStatus to enqueue command and check enqueue success. */ cmd_status = *((volatile unsigned int *)(ps + MFC_CMDStatus)) & 0x3; } while (cmd_status); /* Attempt to enqueue until success */ only enqueues the command

14 DMA – PPE raw access cont. test for completion (poll tag group status) void *ps = get_ps(); unsigned int tag_mask = 1 << 5; unsigned int tag_status; *((volatile unsigned int *)(ps + Prxy_QueryMask)) = tag_mask; __asm__(“eieio”); /* force write to Prxy_QueryMask to complete */ do { tag_status = *((volatile unsigned int *)(ps + Prxy_TagStatus)); } while (!tag_status); more tag groups unsigned int tag_mask = (1<<5)|(1<<14)|(1<<31);

15 DMA – SPE no direct access to the virtual address space only by DMA direct access to own command channels wrch assembly instruction extern void dma_transfer(volatile void *lsa, // local storage address unsigned int eah, // high 32-bit effective address unsigned int eal, // low 32-bit effective address unsigned int size, // transfer size in bytes unsigned int tag_id, // tag identifier (0-31) unsigned int cmd); // DMA command in assembler: wrch $MFC_LSA, $3 wrch $MFC_EAH, $4 wrch $MFC_EAL, $5 wrch $MFC_Size, $6 wrch $MFC_TagID, $7 wrch $MFC_Cmd, $8 in C intrinsic: spu_mfcdma64(lsa, eah, eal, size, tag_id, cmd);

16 DMA – SPE cont. poll for completion # Set tag group mask wrch $MFC_WrTagMask, $0 # Set up for immediate tag status update. il $1, 0 repeat: wrch $MFC_WrTagUpdate, $1 rdch $1, $MFC_RdTagStat brz $1, repeat OR #include unsigned int tag_id = 0; unsigned int tag_mask = 1 << tag_id; spu_writech(MFC_WrTagMask, tag_mask); do { }while(!spu_mfcstat(MFC_TAG_UPDATE_IMMEDIATE)); /* poll for update */

17 DMA – SPE cont. wait for completion (stall SPE) # Set tag group mask wrch $MFC_WrTagMask, $0 # 0x1 for any tag, 0x2 for all tags. il $1, 0x1 # Wait for conditional tag status update (stall the SPU). wrch $MFC_WrTagUpdate, $1 rdch $1, $MFC_RdTagStat OR #include unsigned int tag_id = 0; unsigned int tag_mask = 1 << tag_id; spu_writech(MFC_WrTagMask, tag_mask); /* Wait for all ids in tag group to complete (stall the SPU) */ spu_mfcstat(MFC_TAG_UPDATE_ALL);

18 DMA – SPE cont. completion of DMA source buffer can be reused data may not have yet been written to the main storage mailbox-ed notification can reach PPE before the data SPE can do mfcsync PPE can do lwsync  more efficient SPE can notify via DMA  mfceieio must be used between DMAs for ordering

19 Mailboxes 32bit messages blocking for SPE (stalls SPE) reading of empty inbound writing of full outbound SPE can poll the number of messages non-blocking for PPE (and other devices) reading returns zeros writing overwrites last message

20 Mailboxes – SPE send (stalling) wrch $SPU_WrOutMbox, $1 or spu_writech(SPU_WrOutMbox, mb_value); send (active waiting) repeat: rchcnt $2, $SPU_WrOutMbox brz $2, repeat wrch $SPU_WrOutMbox, $1 or do { /* Do other useful work while waiting. */ } while (!spu_readchcnt(SPU_WrOutMbox)); spu_writech(SPU_WrOutMbox, mb_value);

21 Mailboxes – SPE cont. read (stalling) rdch $1, $SPU_RdInMbox or mb_value = spu_readch(SPU_RdInMbox); read (active waiting) repeat: rchcnt $1, $SPU_RdInMbox brz $1, repeat rdch $2, $SPU_RDInMbox or do { /* Do other useful work while waiting. */ } while (!spu_readchcnt(SPU_RdInMbox)); mb_value = spu_readch(SPU_RdInMbox);

22 Mailboxes – PPE read SPE’s outbound mailboxsend void *ps = get_ps(); unsigned int mb_status; unsigned int new; unsigned int mb_value; do { mb_status = *((volatile unsigned int *)(ps + SPU_Mbox_Stat)); new = mb_status & 0x000000FF; } while ( new == 0 ); mb_value = *((volatile unsigned int *)(ps + SPU_Out_Mbox));

23 Mailboxes – PPE cont. writing to SPE’s inbound mailbox problem of overrunning full mailbox //send four messages without overrunning the mailbox void *ps = get_ps(); unsigned int j,k = 0; unsigned int mb_status; unsigned int slots; unsigned int mb_value[4] = {0x1, 0x2, 0x3, 0x4}; do { /* Poll the Mailbox Status Register until the SPU_In_Mbox_Count field indicates there is at least one slot available in the SPU Read Inbound Mailbox. */ do { mb_status = *((volatile unsigned int *)(ps + SPU_Mbox_Stat)); slots = (mb_status & 0x0000FF00) >> 8; } while ( slots == 0 ); for (j=0; j<slots && k < 4; j++) { *((volatile unsigned int *)(ps + SPU_In_Mbox)) = mb_value[k++]; } } while ( k < 4 );

24 CELL SDK 3.1 http://www.ibm.com/developerworks/power/cell/ Cell BE Programming Handbook Including PowerXCell 8i http://www- 01.ibm.com/chips/techlib/techlib.nsf/techdocs/1741C509C5F64B 3300257460006FD68D http://www- 01.ibm.com/chips/techlib/techlib.nsf/techdocs/1741C509C5F64B 3300257460006FD68D SPE Runtime Management Library http://www- 01.ibm.com/chips/techlib/techlib.nsf/techdocs/1DFEF31B321111 2587257242007883F3 http://www- 01.ibm.com/chips/techlib/techlib.nsf/techdocs/1DFEF31B321111 2587257242007883F3 PPU & SPU C/C++ Language Extension Specification http://www- 01.ibm.com/chips/techlib/techlib.nsf/techdocs/30B3520C93F437 AB87257060006FFE5E http://www- 01.ibm.com/chips/techlib/techlib.nsf/techdocs/30B3520C93F437 AB87257060006FFE5E

25 libspe & libspe2 low level APIs to access Cell from C/C++ new threading model in libspe2 use threading library of your choice and use libspe2 from there – no “SPE threads” create e.g. pthread thread and launch SPE code from that – call returns after SPE finishes

26 Compilation PPE object g++ [-m64] -c -Ox SPE object spu-gcc -Ox no –m64  LS adresses are always 32bit ppu-embedspu [-m64] link g++ [-m64] -lspe -lspe2

27 Referencing SPE code from PPE code extern spe_program_handle_t ; spe_program_load(spe_context,& );

28 Launching SPE code (libspe2) struct thread_data { spe_context_ptr_t context; program_data* pd; }; void *ppu_pthread_function(void *arg) { thread_data td = *(thread_data *) arg; spe_context_ptr_t context = td.context; unsigned int entry = SPE_DEFAULT_ENTRY; spe_context_run(context,&entry,0,td.pd,NULL,NULL); pthread_exit(NULL); } spe_context_ptr_t context; pthread_t pthread; thread_data td; context = spe_context_create(0,NULL); spe_program_load(context,&spe_prg); pthread_create(&pthread,NULL,&ppu_pthread_function,&td[spe]); pthread_join(pthread,NULL); spe_context_destroy(context);

29 SPE code #include int main( unsigned long long spe_id, unsigned long long program_data_ea, unsigned long long env) { program_data pd __attribute__((aligned(16))); int tag_id = 1; mfc_get(&pd, program_data_ea, sizeof(pd), tag_id, 0, 0); mfc_write_tag_mask(1<<tag_id); mfc_read_tag_status_any(); … }

30 Program data structure shared by SPE and PPE code unsigned long long for 64bit pointers void* is 32bit on SPE and 32/64bit on PPE be careful with the alignment DMA cannot handle misaligned transfers size padded to 16byte

31 DMA – SPE side (void) mfc_put(volatile void *ls, uint64_t ea, uint32_t size, uint32_t tag, uint32_t tid, uint32_t rid) initiate transfer from LS tag is number (e.g. 5) mfc_putb, mfc_putf

32 DMA – SPE side cont. (void) mfc_get(volatile void *ls, uint64_t ea, uint32_t size, uint32_t tag, uint32_t tid, uint32_t rid) mfc_getb, mfc_getf

33 DMA status – SPE side (void) mfc_write_tag_mask (uint32_t mask) tag mask (e.g. 1<<5) (uint32_t) mfc_read_tag_status_any(void) blocks untill any of the specified tag groups has no outstanding operations (uint32_t) mfc_read_tag_status_all(void) blocks untill all of the specified tag groups have no outstanding operations

34 Mailboxes – SPE side (uint32_t) spu_read_in_mbox(void) (uint32_t) spu_stat_in_mbox(void) (void) spu_write_out_mbox(uint32_t data) (uint32_t) spu_stat_out_mbox(void)

35 Mailboxes – PPE side int spe_out_mbox_read (spe_context_ptr_t spe, unsigned int *mbox_data, int count) int spe_out_mbox_status (spe_context_ptr_t spe) int spe_in_mbox_write (spe_context_ptr_t spe, unsigned int *mbox_data, int count, unsigned int behavior) SPE_MBOX_ALL_BLOCKING blocks until all are sent SPE_MBOX_ANY_BLOCKING blocks until at least one message is sent SPE_MBOX_ANY_NONBLOCKING sends as many as possible without blocking int spe_in_mbox_status (spe_context_ptr_t spe)

36 PPE direct access to SPE void* spe_ls_area_get (spe_context_ptr_t spe) less efficient than DMA int spe_ls_size_get (spe_context_ptr_t spe) void* spe_ps_area_get (spe_context_ptr_t spe, enum ps_area area) enum ps_area SPE_MFC_COMMAND_AREA  MFC registers SPE_CONTROL_AREA  mailboxes the get_ps function used in examples from the first part


Download ppt "Cell/B.E. Jiří Dokulil. Introduction Cell Broadband Engine developed Sony, Toshiba and IBM 64bit PowerPC PowerPC Processor Element (PPE) runs OS SIMD."

Similar presentations


Ads by Google