Presentation is loading. Please wait.

Presentation is loading. Please wait.

Toolkits version 1.0 Special Cource on Computer Architectures 2010 1.

Similar presentations


Presentation on theme: "Toolkits version 1.0 Special Cource on Computer Architectures 2010 1."— Presentation transcript:

1 Toolkits version 1.0 Special Cource on Computer Architectures 2010 1

2 Contents Introduction of the toolkits used for the contest What is “Longest Common Subsequence : LCS “ ? How to use toolkit ver1.0. Towards the fastest program Special Cource on Computer Architectures 2010 2

3 Challenge Compute Longest Common Subsequence : LCS of two given sequence of letters A and B. Compute as many sequences as possible in a given limit of time. Special Cource on Computer Architectures 2010 3

4 What is the LCS(1/4) Longest Common Subsequence : LCS Subsequence is a sequence consisting of letters from the sequence. Example: X =,,,, etc. Letters should not be continuous order, but keep the order of two letters. The common subsequence of two sequneces. Example: X =, Y = The longest Common Subsequence is, the length is 3. (See) http://en.wikipedia.org/wiki/Longest_common_subsequence_pr oblem Special Cource on Computer Architectures 2010 4

5 How to get the LCS(2/4) How does it compute? Let be two sequenes X, Y. The i-th LCS and the j-th LCS can be computed from smaller LCS. That is, LCS(i, j) is computed from the follows. LCS(i-1, j) LCS(i, j-1) LCS(i-1, j-1) Special Cource on Computer Architectures 2010 5

6 How to get the LCS(3/4) When the last letter is the same : X i = Y j = LCS(i, j) is LCS ( i-1, j-1) + 1 When the last letter is not the same: X i = Y j = LCS(i, j) is larger one from LCS(i-1, j) or LCS(i, j-1) Special Cource on Computer Architectures 2010 6

7 How to get the LCS(4/4) Dynamic Programmming, DP X =, Y = Special Cource on Computer Architectures 2010 000 0111 0122 0122 0123 ABA A B C A 7 LCS!! Assuming the left table, the algorithm shown in the previous slide is: Up , max Left , Left-Up + (Xi == Yj ) ? 1 : 0 Starting from the left most cell, all entries in the table can be filled sequentially.

8 Approach Implement ppe.c for PPE and spe.c for SPE for computing the LCS. – The following programs must not be changed. PPE programs (main_ppe.c , define.h) Special Cource on Computer Architectures 2010 8

9 The step in toolkit ver1.0 Example: Compute the code distance with multiple SPEs. Files except main_ppe.c and define.h can be modified. – Each SPE computes based on a block including 128 letters. Special Cource on Computer Architectures 2010 9

10 toolkit ver1.0 ppe.c PPE Source code spe.c SPE Source code main_ppe.c Modification forbidden spe.h define.h Modification forbidden Makefile getrndstr.c Get the random sequence of letters. lcs.c The seqeuntial LCS(For verification of the result) ans.txt The answer of the sample problem. rep/ There are files for the sample problem. Special Cource on Computer Architectures 2010 10

11 How to user toolkit ver1.0(1/3) Specify two files as attributes, and compute the LCS of the sequences in the files. Use multiple SPE in the initialization state. Limitation: The number of the sequence is multiples of 128. Example files for various data size are prepared. Use getrndstr.c to generate arbitoray size of random sequences. $ gcc -O3 -o getrndstr getrndstr.c $./getrndstr 128 13 > file9999 Generate file9999 including sequence of 128 litters with random seed 13. Special Cource on Computer Architectures 2010 11

12 How to use toolkit ver1.0 (2/3) After decoding toolkit1.0.tgz, use make for compilation. How to start example program make run{number} (From 1 to 5) Special Cource on Computer Architectures 2010 12 Problem Number Length of A, B Execution Time The length of the LCS

13 How to use toolkit ver1.0(3/3) Verify the results using lcs.c (Note that, the results of examples executed by make run* are in ans.txt. Use in the other cases.) Special Cource on Computer Architectures 2010 13

14 Summary of limitation The size is multiples of 128(char type) Given two sequences are called Sequence A and Sequence B. Code is based on libspe. The program can be also in PPE. At most 7 SPEs can be used in parallel. The memory on PPE can be used freely. Special Cource on Computer Architectures 2010 14

15 Hints Divide the sequence into sub-blocks, then you can divide the total process. Parallel processing of the sub-blocks by SPE can improve performance with parallel processing. Which part can the parallel processing be applied? Special Cource on Computer Architectures 2010 15

16 Parallel Processing(1/3) Data Dependency: For computing the next element, three elements: Left, Up, Left- Up must be fixed. Ele mnt Special Cource on Computer Architectures 2010 16

17 Parallel Processing(2/3) If blue part is fixed, the pink part can be computed in parallel. The same method can be applied to blocks instead of elements. Ele me nt Special Cource on Computer Architectures 2010 17

18 Parallel Processing(3/3) In order to compute the pink block: 1. The right lower most element of the left upper block, 2. the lower most row of the upper block, and 3. the right most column of the left block are needed. block Special Cource on Computer Architectures 2010 18

19 Input/Output of block computation Input: The right-lower most element of the upper left block. The lower most row of the upper block. The right most column of the left block. Output: The right-lower most element of the computed score-table. The lower most row The right most column Special Cource on Computer Architectures 2010 19

20 Control of SPE subroutine Make a queue to manage the job on PPE PPE Job Queue SPE Start the job Inform the end of job Tail Head Add the job Process on PPE Based on the computed block number, add the block number which can start the computation. Candidates are left/lower blocks. Read the block number from the queue and assign it to the free SPE. Continue it until the right most block is computed. Special Cource on Computer Architectures 2010 20

21 Subroutines for DMA (FYI) Functions for data transfer dmaget, dmaput : DMA write/read functions supported by the tool kit. dmaget((void *))spe_addr, ppe_addr, X); From ppe_addr, read Xbyte data , store them from pe_addr of LocalStore . dmaput is for opposite direction data transfer . Special Cource on Computer Architectures 2010 21 128Byte Aligned address ppe_addr spe_addr PPE(Main memory)SPE(LocalStore)

22 Towards the fastest program Improve spe.c to fill the table. Improve ppe.c to control blocks for computation. For parallel processing: Use SIMD instruction in SPE. An operation can treat multiple elements. Anyway, compute a large number of elements with an instruction as possible. Loop unrolling, builtin expect , double buffering are useful techniques to try. Good Luck! Special Cource on Computer Architectures 2010 22


Download ppt "Toolkits version 1.0 Special Cource on Computer Architectures 2010 1."

Similar presentations


Ads by Google