A Concurrent Matrix Transpose Algorithm, The Implementation Presentedby Pourya Jafari.

Slides:



Advertisements
Similar presentations
1 Concurrency: Deadlock and Starvation Chapter 6.
Advertisements

Chapter 6 Concurrency: Deadlock and Starvation Operating Systems: Internals and Design Principles, 6/E William Stallings Patricia Roy Manatee Community.
Chapter 6 Concurrency: Deadlock and Starvation Operating Systems: Internals and Design Principles, 6/E William Stallings Patricia Roy Manatee Community.
Tables and Information Retrieval
Programming in Occam-pi: A Tutorial By: Zain-ul-Abdin
Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.
1 Concurrency: Deadlock and Starvation Chapter 6.
Concurrency: Deadlock and Starvation Chapter 6. Deadlock Permanent blocking of a set of processes that either compete for system resources or communicate.
Concurrency: Deadlock and Starvation Chapter 6. Deadlock Permanent blocking of a set of processes that either compete for system resources or communicate.
Chapter 6 Concurrency: Deadlock and Starvation Operating Systems: Internals and Design Principles, 6/E William Stallings Patricia Roy Manatee Community.
Concurrency: Deadlock and Starvation Chapter 6. Deadlock Permanent blocking of a set of processes that either compete for system resources or communicate.
Concurrency: Mutual Exclusion and Synchronization Chapter 5.
Deadlocks, Message Passing Brief refresh from last week Tore Larsen Oct
Structured Thread Models Kahn Process Networks, CSP, Go 1Dennis Kafura – CS5204 – Operating Systems.
Sahalu Junaidu ICS 573: High Performance Computing 8.1 Topic Overview Matrix-Matrix Multiplication Block Matrix Operations A Simple Parallel Matrix-Matrix.
COEN 180 DRAM. Dynamic Random Access Memory Dynamic: Periodically refresh information in a bit cell. Else it is lost. Small footprint: transistor + capacitor.
Maths for Computer Graphics
Chapter 9 Memory Basics Henry Hexmoor1. 2 Memory Definitions  Memory ─ A collection of storage cells together with the necessary circuits to transfer.
Data Parallel Algorithms Presented By: M.Mohsin Butt
Concurrency CS 510: Programming Languages David Walker.
Pipelined Computations Divide a problem into a series of tasks A processor completes a task sequentially and pipes the results to the next processor Pipelining.
Witawas Srisa-an Chapter 6
Registers  Flip-flops are available in a variety of configurations. A simple one with two independent D flip-flops with clear and preset signals is illustrated.
CPSC 4650 Operating Systems Chapter 6 Deadlock and Starvation
1 Concurrency: Deadlock and Starvation Chapter 6.
Chapter 6 Concurrency: Deadlock and Starvation
A. Frank - P. Weisberg Operating Systems Introduction to Cooperating Processes.
1 I/O Management in Representative Operating Systems.
A Brief Look At MPI’s Point To Point Communication Brian T. Smith Professor, Department of Computer Science Director, Albuquerque High Performance Computing.
Today Objectives Chapter 6 of Quinn Creating 2-D arrays Thinking about “grain size” Introducing point-to-point communications Reading and printing 2-D.
1 TRAPEZOIDAL RULE IN MPI Copyright © 2010, Elsevier Inc. All rights Reserved.
Deadlocks in Distributed Systems Deadlocks in distributed systems are similar to deadlocks in single processor systems, only worse. –They are harder to.
1 Chapter 9 Spaces with LINDA. 2 Linda Linda is an experimental programming concept unlike ADA or Occam which are fully developed production-quality languages.
1 Concurrency: Deadlock and Starvation Chapter 6.
HSDSL, Technion Spring 2014 Preliminary Design Review Matrix Multiplication on FPGA Project No. : 1998 Project B By: Zaid Abassi Supervisor: Rolf.
1 Chapter Client-Server Interaction. 2 Functionality  Transport layer and layers below  Basic communication  Reliability  Application layer.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February 8, 2005 Session 8.
CS5204 – Operating Systems 1 Communicating Sequential Processes (CSP)
1 High-Performance Grid Computing and Research Networking Presented by Xing Hang Instructor: S. Masoud Sadjadi
CS 152: Programming Language Paradigms May 7 Class Meeting Department of Computer Science San Jose State University Spring 2014 Instructor: Ron Mak
Targil 6 Notes This week: –Linear time Sort – continue: Radix Sort Some Cormen Questions –Sparse Matrix representation & usage. Bucket sort Counting sort.
SEQUENTIAL CIRCUITS Component Design and Use. Register with Parallel Load  Register: Group of Flip-Flops  Ex: D Flip-Flops  Holds a Word of Data 
1 Announcements The fixing the bug part of Lab 4’s assignment 2 is now considered extra credit. Comments for the code should be on the parts you wrote.
1 Data Structures CSCI 132, Spring 2014 Lecture 32 Tables I.
MDTM Application Implementation (Discussion Topics) 1.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February Session 11.
Multithreading Chapter Introduction Consider ability of _____________ to multitask –Breathing, heartbeat, chew gum, walk … In many situations we.
Decomposition Data Decomposition – Dividing the data into subgroups and assigning each piece to different processors – Example: Embarrassingly parallel.
8.2 Operations With Matrices
Synchronicity Introduction to Operating Systems: Module 5.
CSCI-455/552 Introduction to High Performance Computing Lecture 23.
Chapter 6 Concurrency: Deadlock and Starvation Operating Systems: Internals and Design Principles, 6/E William Stallings.
Introduction Contain two or more CPU share common memory and peripherals. Provide greater system throughput. Multiple processor executing simultaneous.
Unit-8 Sorting Algorithms Prepared By:-H.M.PATEL.
CSC CSC 143 Threads. CSC Introducing Threads  A thread is a flow of control within a program  A piece of code that runs on its own. The.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February Session 12.
Channels. Models for Communications Synchronous communications – E.g. Telephone call Asynchronous communications – E.g. .
Precalculus Section 14.1 Add and subtract matrices Often a set of data is arranged in a table form A matrix is a rectangular.
A Concurrent Matrix Transpose Algorithm Pourya Jafari.
JCSP Tutorial & Homework Hints
Activity Diagram.
CS703 - Advanced Operating Systems
Multipliers Multipliers play an important role in today’s digital signal processing and various other applications. The common multiplication method is.
Dr. Ameria Eldosoky Discrete mathematics
Chapter 7 Deadlock.
COT 5611 Operating Systems Design Principles Spring 2014
COMP60611 Fundamentals of Parallel and Distributed Systems
Channels.
Channels.
Concurrency in GO CS240 23/8/2017.
Presentation transcript:

A Concurrent Matrix Transpose Algorithm, The Implementation Presentedby Pourya Jafari

Review: Algorithm Steps Pre-process inside each thread Shift rows Shift rows Intra-process/thread communication Shift columns Shift columns Post-process inside each thread Shift rows again Shift rows again

Review: Shift values? Set shifts based on row index : range 0 to N-1 Now arrange the rows, so that column shifts gets us to i Preprocess shifting: i’ = i - L Preprocess shifting: i’ = i - L After intra-process shift columns should be equal to original row index i After intra-process shift columns should be equal to original row index i i’ + j = i i’ + j = i i - L + j = i L = - j i - L + j = i L = - j So we shift each column j cells up

Review: Last step ? 1 → 2: Column shift j up 2 → 3: Row shift based on row indices 3 → 4: ? Change of indices so far Change of indices so far (i - j, j) → (i - j, i - j + j) (i - j, i) = (m, n) (i - j, j) → (i - j, i - j + j) (i - j, i) = (m, n) One operation to change row index to j One operation to change row index to j n - m = (i - (i - j))= j (1)(2-a)(2-b)(3) (4)

Review: Radix Using radix representation, we can group row shifts We use radix 2 for simplicity Digits are bit representation, Shift all row indices have their k-th bit on Digits are bit representation, Shift all row indices have their k-th bit on Shift for each row k=0 k=1 =+

The concurrency picture Each thread can do pre/post processing independently Processes must synchronize after each phase after each phase after each step of intra-process step after each step of intra-process step during intra-process communications during intra-process communications

Communication package (1) We need a mean of communication Facilitates synchronized communication Facilitates synchronized communication Provides unbuffered communication to save memory Provides unbuffered communication to save memory JCSP: based on the algebra of Communicating Sequential Processes (CSP) has strong theory background has strong theory background Object Oriented Object Oriented

Communication package (2) JCSP provides One2OneChannel One2OneChannel Where a single sender can send and a single receiver can receive One2AnyChannel One2AnyChannel Where a single sender and many receiver can communicate but one at the same time Any2OneChannel Any2OneChannel Multiple senders and one receiver

Classes (1) CProcess: Column process Has a PID; Knows N; Has an array to save its items Has a PID; Knows N; Has an array to save its items One2OneChannel to each other process for intra-process shift operation One2OneChannel to each other process for intra-process shift operation One2AnyChannel to MProcess to receive start/resume calls One2AnyChannel to MProcess to receive start/resume calls Any2OneChannel to MProcess to signal that this CProcess has finished current step Any2OneChannel to MProcess to signal that this CProcess has finished current step

Classes (2) MProcess: Master Process One2Any Channel AnytoOneChannel to any CProcess One2Any Channel AnytoOneChannel to any CProcess Synchronizes the phases and intra-process communication by waiting for all CProcesses to finish current phase and then resume them for the next phase Synchronizes the phases and intra-process communication by waiting for all CProcesses to finish current phase and then resume them for the next phase

Classes (3) Launcher: Threads driver Create channels Create channels Create one MProcess and CProcess Create one MProcess and CProcess Run them in parallel Run them in parallel

Intra-process communication in CProcess Might send/receive multiple items Determines the indices that need to be shifted Determines the indices that need to be shifted Packs them in form of a message Packs them in form of a message Sends the message to the next CProcess and receive from the previous process in the shift chain Sends the message to the next CProcess and receive from the previous process in the shift chain Unpack the received message Unpack the received message Assign the items inside to the same indices determined in the first step Assign the items inside to the same indices determined in the first step

UML Diagram

The Intraprocess Shift Synchronized send and then receive Cycle might form All CProcesses will go to send state and wait for the next CProcess to receive All CProcesses will go to send state and wait for the next CProcess to receive None of CSProcesses receive -> Deadlock None of CSProcesses receive -> Deadlock

The Shift Cycle (1) One CProcess in the cycle should receive to break the cycle One CProcess in the cycle should receive to break the cycle But will lose the value which has to send Receives and buffers the send value Sends and then assign the buffered value to the relevant array cell

The Shift Cycle (3) Cycles happen when the interleaving value h divides N We do buffered read for all numbers less than h

The Shift Cycle (3) Even after this, the program runs into deadlock again Cycles form when gcd(h, N) is greater than 1 Must buffer values less than equal to gcd(h, N)

Results