LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung Wong Chung Hoi Supervised by Prof. Michael R. Lyu Department of Computer Science and Engineering, CUHK Final Year Project Presentation (1st term)
LYU0703 Parallel Distributed Programming on PS3 2 Agenda Background Information Architecture of PlayStation ® 3 Principals of Parallel Programming Optimization of the ADVISER program: 1. Sequential Approach 2. Parallel Approach Conclusion Future Works Q&A
LYU0703 Parallel Distributed Programming on PS3 3 Background Information Limitation of single-core processor: 1.Memory Access Latency 2.Wire Delays 3.Power Consumption
LYU0703 Parallel Distributed Programming on PS3 4 Power Consumption P = power C = capacitance V = voltage F = processor frequency (cycles per second)
LYU0703 Parallel Distributed Programming on PS3 5 Development of Multi-Core Processor Fig. 1.4 Growth of No. of Cores in Processors
LYU0703 Parallel Distributed Programming on PS3 6 Development of Multi-Core Processor Reduce power consumption - use multiple cores with low frequency instead of one with high frequency Efficient processing of multiple tasks - divide the computation work - execute among the cores concurrently
LYU0703 Parallel Distributed Programming on PS3 7 Project Objectives Need of parallel programming to optimize intensive-computation applications Study features of parallel programming, compare sequential and parallel approach Optimize an application, showing great improvement by parallel programming
LYU0703 Parallel Distributed Programming on PS3 8 Architecture of PlayStation ® 3 (PS3) A multi-core machine produced by Sony, with the Cell Broadband Engine Strong Computation Power Opened platform for other applications and development
LYU0703 Parallel Distributed Programming on PS3 9 Cell Broadband Engine (Cell BE) PPE – Power Processor Element SPE – Synergistic Processor Element EIB – Element Interconnect Bus
LYU0703 Parallel Distributed Programming on PS3 10 Power Processor Element (PPE) 64-bit PowerPC architecture based General purpose operation Designed as control- intensive Control I/O of main memeory and other devices by the OS Control over all 8 SPEs Fig. 2.5 Design of PPE
LYU0703 Parallel Distributed Programming on PS3 11 Synergistic Processor Element (SPE) Designed to provide computation performance SPU – perform allocated task LS – the only memory MFC – control data transfer Totally 8 SPEs in Cell Only 6 acessisble 1 reserved for system software 1 disabled Fig. 2.6 Design of a SPE
LYU0703 Parallel Distributed Programming on PS3 12 Element Interconnect Bus (EIB) Internal communication bus inside Cell Connect different elements: PPE, SPEs. Memory controller Fig. 2.7 Data Flow and Program Control
LYU0703 Parallel Distributed Programming on PS3 13 Principal of Parallel Programming Parallel algorithmSerial algorithm multiple processing unitssingle processing unit communication overheadno communication overhead higher complexity in codestraight forward code ensure load balance between PUeverything is done by CPU
LYU0703 Parallel Distributed Programming on PS3 14 Concept of Load Balance Distribute data evenly Total runtime depends on the busiest processing element Wasting computation time on idling processing element
LYU0703 Parallel Distributed Programming on PS3 15 Method of parallelism Data parallelism Task parallelism
Parallel Architecture Flynn's taxonomy Single Instruction Multiple Instruction Single Data SISDMISD Multiple Data SIMDMIMD LYU0703 Parallel Distributed Programming on PS3 16
SISD Traditional Computer von Neumann model LYU0703 Parallel Distributed Programming on PS3 17
SIMD Same instruction on all data Data parallelism SIMD intrinsic function LYU0703 Parallel Distributed Programming on PS3 18
MISD No well known system Mention for completeness LYU0703 Parallel Distributed Programming on PS3 19
MIMD Different instruction on different data Task parallelism Further break down to –Shared Memory System –Distributed Memory System LYU0703 Parallel Distributed Programming on PS3 20
Shared Memory System Access to central memory for data PS3 :Achieve by MFC issuing DMA command LYU0703 Parallel Distributed Programming on PS3 21
Distributed Memory System Each PE has its own memory PS3: Each SPE has 256KB Local Store PS3 is hybrid shared-distributed memory system LYU0703 Parallel Distributed Programming on PS3 22
ADVISER Comparing 2 video clips 1.Generating meaningful data (in form of numbers) of frames from the video 2.Comparing and looking for the most similar frames 3.Locating the similar segment which consist of a series of very similar frames LYU0703 Parallel Distributed Programming on PS3 23
Input 2 Folder, “Repository” & “Target” hl3 file = vector of 1024 double precision values LYU0703 Parallel Distributed Programming on PS3 24 InputNo. of hl3 files Target directory5473 Repository directory7547
Processing hl3 file = vector of 1024 double precision values File P File Q Similarity = Smaller the better LYU0703 Parallel Distributed Programming on PS3 25
Output M “Target”, N “Repository” O ( M * N ) Computation time = 633 sec Flash demo LYU0703 Parallel Distributed Programming on PS3 26 target hl3 1most match repository Adifference value = ?? target hl3 2most match repository Bdifference value = ?? target hl3 3most match repository Cdifference value = ??
Parallel Version Data parallelism Split data to 6 SPEs evenly Computation time for 6 SPEs = 330 sec Flash demo LYU0703 Parallel Distributed Programming on PS3 27
Parallel Version Expected speed up 6X Actual speed up 2X PC and PPU, SPE all run at different speed Computation time with CPU = 633 sec Computation time with 1 SPE = 1928 sec Computation time with PPU = 3119 sec CPU > SPE > PPU LYU0703 Parallel Distributed Programming on PS3 28
Time Attack 1.SIMD intrinsic function 2.Changing data type 3.Double Buffering 4.Parallel Read 5.Distributing Job to idling PPE 6.SIMD on loop counter 7.Loop unrolling LYU0703 Parallel Distributed Programming on PS3 29
SIMD intrinsic function Addition, subtraction, multiplication, etc. Operates on 128 bits registers Date type: double (64 bits) Speed up 2X LYU0703 Parallel Distributed Programming on PS3 30
Changing Data Type to int Precision not important Major speed up from SIMD intrinsic Data type: int (32 bits) Total Speed up 4X Computation time = 71 sec LYU0703 Parallel Distributed Programming on PS3 31
Changing Data Type to float SPE specified for high precision computation No intrinsic for int data type at all Data Type: float (32 bits) Save data conversion time Speed up by 30% Computation time = 49 sec LYU0703 Parallel Distributed Programming on PS3 32
Double buffering Save communication time MFC and SPU 2 buffers –Prefetching –Processing Not heavy in communication Minor speed up LYU0703 Parallel Distributed Programming on PS3 33
LYU0703 Parallel Distributed Programming on PS3 34 Parallel Reading for All Files Read “ Target ” and “ Repository ” concurrently Share file reading job among SPEs Not improve as predicted, even slower Reason: hard disk cannot cannot handle concurrent request Failed Attempt
LYU0703 Parallel Distributed Programming on PS3 35 Distributing Job to Idling PPE PPE current job: read files, distribute files, collect result Use stall time to do some computation Relatively low computation power of PPE No significant improvement Increase program complexity Abandon this approach
LYU0703 Parallel Distributed Programming on PS3 36 Applying SIMD for Loop Counter Major computation power consumed in: initialize i = 0, diff = (0, 0, 0, 0). for i < Number of float numbers in a file / Number of floats packed in a register A. temp = SIMD subtraction on vector i in “ Target ” and “ Repository ” file. B. diff = SIMD addition (SIMD multiplication (temp, temp), diff). i = i + 1. Loop back to 2.
LYU0703 Parallel Distributed Programming on PS3 37 Applying SIMD for Loop Counter Try to optimize step 3 Apply SIMD to the loop counter Addition and comparison operations are reduced by 8 times
LYU0703 Parallel Distributed Programming on PS3 38 Applying SIMD for Loop Counter initialize i = (0,1,2,3,4,5,6,7), diff = (0, 0, 0, 0). for i[0] < Number of float numbers in a file / Number of floats packed in a register temp = SIMD subtraction on vector i[0] in “ Target ” and “ Repository ” file. diff = SIMD addition (SIMD multiplication (temp, temp), diff). temp = SIMD subtraction on vector i[1] in “ Target ” and “ Repository ” file. diff = SIMD addition (SIMD multiplication (temp, temp), diff). temp = SIMD subtraction on vector i[2] in “ Target ” and “ Repository ” file. diff = SIMD addition (SIMD multiplication (temp, temp), diff). temp = SIMD subtraction on vector i[3] in “ Target ” and “ Repository ” file. diff = SIMD addition (SIMD multiplication (temp, temp), diff). temp = SIMD subtraction on vector i[4] in “ Target ” and “ Repository ” file. diff = SIMD addition (SIMD multiplication (temp, temp), diff). temp = SIMD subtraction on vector i[5] in “ Target ” and “ Repository ” file. diff = SIMD addition (SIMD multiplication (temp, temp), diff). temp = SIMD subtraction on vector i[6] in “ Target ” and “ Repository ” file. diff = SIMD addition (SIMD multiplication (temp, temp), diff). temp = SIMD subtraction on vector i[7] in “ Target ” and “ Repository ” file. diff = SIMD addition (SIMD multiplication (temp, temp), diff). i = SIMD addition (i, (8, 8, 8, 8, 8, 8, 8, 8)). Loop back to 2.
LYU0703 Parallel Distributed Programming on PS3 39 Result of the parallel, with SIMD, float input, SIMD for loop counter PS3 version No. of SPU used Read input time (sec) Total Elapsed time (sec) Net Elapsed time (sec)
LYU0703 Parallel Distributed Programming on PS3 40 Result of the parallel, with SIMD, float input, SIMD for loop counter PS3 version
LYU0703 Parallel Distributed Programming on PS3 41 Result of the parallel, with SIMD, float input, SIMD for loop counter PS3 version little improvement (about 4%). shows the possibility to have faster performance by further loop unrolling. The best performance becomes 47 sec
LYU0703 Parallel Distributed Programming on PS3 42 Loop Unrolling Proved that optimizing the loop can improve performance Completely loop unrolling More obvious speed up
LYU0703 Parallel Distributed Programming on PS3 43 Result of the parallel, with SIMD, float input, loop unrolling PS3 version No. of SPU used Read input time (sec) Total Elapsed time (sec) Net Elapsed time (sec)
LYU0703 Parallel Distributed Programming on PS3 44 Result of the parallel, with SIMD, float input, loop unrolling PS3 version
LYU0703 Parallel Distributed Programming on PS3 45 Result of the parallel, with SIMD, float input, loop unrolling PS3 version 45% faster ultimate best performance becomes 27 sec
LYU0703 Parallel Distributed Programming on PS3 46 Conclusion of Optimization PC version: 663 sec PS3 with 1 SPU (i.e. sequential version on PS3): 1928 sec Final optimized version of PS3: 27 sec 23 times faster than PC version 71 times faster than sequential version on PS3
LYU0703 Parallel Distributed Programming on PS3 47 Conclusion of Optimization
LYU0703 Parallel Distributed Programming on PS3 48 Future Works Port the whole ADVISER application on PlayStation ® 3 Optimization throughout the whole application
LYU0703 Parallel Distributed Programming on PS3 49 Q&A
LYU0703 Parallel Distributed Programming on PS3 50 The End