Toolkits version 1.0 Special Cource on Computer Architectures 2010 1.

Slides:



Advertisements
Similar presentations
Introduction to C Programming
Advertisements

Categories of I/O Devices
Memory.
CS 400/600 – Data Structures External Sorting.
Part IV: Memory Management
Lecture 4 Introduction to Digital Signal Processors (DSPs) Dr. Konstantinos Tatas.
Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Dynamic Programming Nithya Tarek. Dynamic Programming Dynamic programming solves problems by combining the solutions to sub problems. Paradigms: Divide.
Memory Management Chapter 7.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Allocating Memory.
Lab6 – Debug Assembly Language Lab
Using Cell Processors for Intrusion Detection through Regular Expression Matching with Speculation Author: C˘at˘alin Radu, C˘at˘alin Leordeanu, Valentin.
Data Structures Lecture 10 Fang Yu Department of Management Information Systems National Chengchi University Fall 2010.
Chapter 10.
DIRECT MEMORY ACCESS CS 147 Thursday July 5,2001 SEEMA RAI.
Introduction to Analysis of Algorithms
Development of a Ray Casting Application for the Cell Broadband Engine Architecture Shuo Wang University of Minnesota Twin Cities Matthew Broten Institute.
CSCE 121, Sec 200, 507, 508 Fall 2010 Prof. Jennifer L. Welch.
Informationsteknologi Friday, November 16, 2007Computer Architecture I - Class 121 Today’s class Operating System Machine Level.
Microprocessor Systems Design I Instructor: Dr. Michael Geiger Spring 2012 Lecture 2: 80386DX Internal Architecture & Data Organization.
Programming Logic and Design, Introductory, Fourth Edition1 Understanding Computer Components and Operations (continued) A program must be free of syntax.
Important Problem Types and Fundamental Data Structures
1 - buttons Click “Step Forward” to execute one line of the program. Click “Reset” to start over. “Play,” “Stop,” and “Step Back” are disabled in this.
System Implementation System Implementation - Mr. Ahmad Al-Ghoul System Analysis and Design.
CHP - 9 File Structures. INTRODUCTION In some of the previous chapters, we have discussed representations of and operations on data structures. These.
2012/06/22 Contents  GPU (Graphic Processing Unit)  CUDA Programming  Target: Clustering with Kmeans  How to use.
5-Stage Pipelining Fetch Instruction (FI) Fetch Operand (FO) Decode Instruction (DI) Write Operand (WO) Execution Instruction (EI) S3S3 S4S4 S1S1 S2S2.
Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference.
สาขาวิชาเทคโนโลยี สารสนเทศ คณะเทคโนโลยีสารสนเทศ และการสื่อสาร.
Implementation Yaodong Bi. Introduction to Implementation Purposes of Implementation – Plan the system integrations required in each iteration – Distribute.
Microsoft Visual Basic 2005 CHAPTER 9 Using Arrays and File Handling.
Using Arrays and File Handling
THE BIG PICTURE. How does JavaScript interact with the browser?
Database Management 9. course. Execution of queries.
ADA: 7. Dynamic Prog.1 Objective o introduce DP, its two hallmarks, and two major programming techniques o look at two examples: the fibonacci.
Chap 8. Sequencing and Control. 8.1 Introduction Binary information in a digital computer –data manipulated in a datapath with ALUs, registers, multiplexers,
I/O Management and Disk Structure Introduction to Operating Systems: Module 14.
Incorporating Dynamic Time Warping (DTW) in the SeqRec.m File Presented by: Clay McCreary, MSEE.
Flow of Control Part 1: Selection
Getting Started with MATLAB 1. Fundamentals of MATLAB 2. Different Windows of MATLAB 1.
External Storage Primary Storage : Main Memory (RAM). Secondary Storage: Peripheral Devices –Disk Drives –Tape Drives Secondary storage is CHEAP. Secondary.
Computer Architecture Memory organization. Types of Memory Cache Memory Serves as a buffer for frequently accessed data Small  High Cost RAM (Main Memory)
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.
1 Memory Management Chapter 7. 2 Memory Management Subdividing memory to accommodate multiple processes Memory needs to be allocated to ensure a reasonable.
DIGITAL SIGNAL PROCESSORS. Von Neumann Architecture Computers to be programmed by codes residing in memory. Single Memory to store data and program.
Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 4 Computer Systems Review.
JAVA: An Introduction to Problem Solving & Programming, 6 th Ed. By Walter Savitch ISBN © 2012 Pearson Education, Inc., Upper Saddle River,
Overview von Neumann Architecture Computer component Computer function
CHAPTER 2 PROBLEM SOLVING USING C++ 1 C++ Programming PEG200/Saidatul Rahah.
JAVA: An Introduction to Problem Solving & Programming, 5 th Ed. By Walter Savitch and Frank Carrano. ISBN © 2008 Pearson Education, Inc., Upper.
GPU Programming Contest. Contents Target: Clustering with Kmeans How to use toolkit1.0 Towards the fastest program.
DATA MANAGEMENT 1) File StructureFile Structure 2) Physical OrganisationPhysical Organisation 3) Logical OrganisationLogical Organisation 4) File OrganisationFile.
CPIT Program Execution. Today, general-purpose computers use a set of instructions called a program to process data. A computer executes the.
DYNAMIC MEMORY ALLOCATION. Disadvantages of ARRAYS MEMORY ALLOCATION OF ARRAY IS STATIC: Less resource utilization. For example: If the maximum elements.
Buffering Techniques Greg Stitt ECE Department University of Florida.
Bitwise Operations C includes operators that permit working with the bit-level representation of a value. You can: - shift the bits of a value to the left.
Computer Organization
Top 50 Data Structures Interview Questions
System Programming and administration
Introduction to Algorithms
COMP4211 : Advance Computer Architecture
Array Processor.
Selected Topics: External Sorting, Join Algorithms, …
Instruction encoding We’ve already seen some important aspects of processor design. A datapath contains an ALU, registers and memory. Programmers and compilers.
Functions continued.
File Organization.
Process.
Presentation transcript:

Toolkits version 1.0 Special Cource on Computer Architectures

Contents Introduction of the toolkits used for the contest What is “Longest Common Subsequence : LCS “ ? How to use toolkit ver1.0. Towards the fastest program Special Cource on Computer Architectures

Challenge Compute Longest Common Subsequence : LCS of two given sequence of letters A and B. Compute as many sequences as possible in a given limit of time. Special Cource on Computer Architectures

What is the LCS(1/4) Longest Common Subsequence : LCS Subsequence is a sequence consisting of letters from the sequence. Example: X =,,,, etc. Letters should not be continuous order, but keep the order of two letters. The common subsequence of two sequneces. Example: X =, Y = The longest Common Subsequence is, the length is 3. (See) oblem Special Cource on Computer Architectures

How to get the LCS(2/4) How does it compute? Let be two sequenes X, Y. The i-th LCS and the j-th LCS can be computed from smaller LCS. That is, LCS(i, j) is computed from the follows. LCS(i-1, j) LCS(i, j-1) LCS(i-1, j-1) Special Cource on Computer Architectures

How to get the LCS(3/4) When the last letter is the same : X i = Y j = LCS(i, j) is LCS ( i-1, j-1) + 1 When the last letter is not the same: X i = Y j = LCS(i, j) is larger one from LCS(i-1, j) or LCS(i, j-1) Special Cource on Computer Architectures

How to get the LCS(4/4) Dynamic Programmming, DP X =, Y = Special Cource on Computer Architectures ABA A B C A 7 LCS!! Assuming the left table, the algorithm shown in the previous slide is: Up , max Left , Left-Up + (Xi == Yj ) ? 1 : 0 Starting from the left most cell, all entries in the table can be filled sequentially.

Approach Implement ppe.c for PPE and spe.c for SPE for computing the LCS. – The following programs must not be changed. PPE programs (main_ppe.c , define.h) Special Cource on Computer Architectures

The step in toolkit ver1.0 Example: Compute the code distance with multiple SPEs. Files except main_ppe.c and define.h can be modified. – Each SPE computes based on a block including 128 letters. Special Cource on Computer Architectures

toolkit ver1.0 ppe.c PPE Source code spe.c SPE Source code main_ppe.c Modification forbidden spe.h define.h Modification forbidden Makefile getrndstr.c Get the random sequence of letters. lcs.c The seqeuntial LCS(For verification of the result) ans.txt The answer of the sample problem. rep/ There are files for the sample problem. Special Cource on Computer Architectures

How to user toolkit ver1.0(1/3) Specify two files as attributes, and compute the LCS of the sequences in the files. Use multiple SPE in the initialization state. Limitation: The number of the sequence is multiples of 128. Example files for various data size are prepared. Use getrndstr.c to generate arbitoray size of random sequences. $ gcc -O3 -o getrndstr getrndstr.c $./getrndstr > file9999 Generate file9999 including sequence of 128 litters with random seed 13. Special Cource on Computer Architectures

How to use toolkit ver1.0 (2/3) After decoding toolkit1.0.tgz, use make for compilation. How to start example program make run{number} (From 1 to 5) Special Cource on Computer Architectures Problem Number Length of A, B Execution Time The length of the LCS

How to use toolkit ver1.0(3/3) Verify the results using lcs.c (Note that, the results of examples executed by make run* are in ans.txt. Use in the other cases.) Special Cource on Computer Architectures

Summary of limitation The size is multiples of 128(char type) Given two sequences are called Sequence A and Sequence B. Code is based on libspe. The program can be also in PPE. At most 7 SPEs can be used in parallel. The memory on PPE can be used freely. Special Cource on Computer Architectures

Hints Divide the sequence into sub-blocks, then you can divide the total process. Parallel processing of the sub-blocks by SPE can improve performance with parallel processing. Which part can the parallel processing be applied? Special Cource on Computer Architectures

Parallel Processing(1/3) Data Dependency: For computing the next element, three elements: Left, Up, Left- Up must be fixed. Ele mnt Special Cource on Computer Architectures

Parallel Processing(2/3) If blue part is fixed, the pink part can be computed in parallel. The same method can be applied to blocks instead of elements. Ele me nt Special Cource on Computer Architectures

Parallel Processing(3/3) In order to compute the pink block: 1. The right lower most element of the left upper block, 2. the lower most row of the upper block, and 3. the right most column of the left block are needed. block Special Cource on Computer Architectures

Input/Output of block computation Input: The right-lower most element of the upper left block. The lower most row of the upper block. The right most column of the left block. Output: The right-lower most element of the computed score-table. The lower most row The right most column Special Cource on Computer Architectures

Control of SPE subroutine Make a queue to manage the job on PPE PPE Job Queue SPE Start the job Inform the end of job Tail Head Add the job Process on PPE Based on the computed block number, add the block number which can start the computation. Candidates are left/lower blocks. Read the block number from the queue and assign it to the free SPE. Continue it until the right most block is computed. Special Cource on Computer Architectures

Subroutines for DMA (FYI) Functions for data transfer dmaget, dmaput : DMA write/read functions supported by the tool kit. dmaget((void *))spe_addr, ppe_addr, X); From ppe_addr, read Xbyte data , store them from pe_addr of LocalStore . dmaput is for opposite direction data transfer . Special Cource on Computer Architectures Byte Aligned address ppe_addr spe_addr PPE(Main memory)SPE(LocalStore)

Towards the fastest program Improve spe.c to fill the table. Improve ppe.c to control blocks for computation. For parallel processing: Use SIMD instruction in SPE. An operation can treat multiple elements. Anyway, compute a large number of elements with an instruction as possible. Loop unrolling, builtin expect , double buffering are useful techniques to try. Good Luck! Special Cource on Computer Architectures