Juan Mendivelso.  Serial Algorithms: Suitable for running on an uniprocessor computer in which only one instruction executes at a time.  Parallel Algorithms:

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

© 2009 Charles E. Leiserson and Pablo Halpern1 Introduction to Cilk++ Programming PADTAD July 20, 2009 Cilk, Cilk++, Cilkview, and Cilkscreen, are trademarks.
Distributed Systems CS
SE-292 High Performance Computing
1 Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.
U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and.
CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad.
Performance Analysis of Multiprocessor Architectures
Lecture 7 : Parallel Algorithms (focus on sorting algorithms) Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture.
Distributed Process Scheduling Summery Distributed Process Scheduling Summery BY:-Yonatan Negash.
Extending the Unified Parallel Processing Speedup Model Computer architectures take advantage of low-level parallelism: multiple pipelines The next generations.
CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.
11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.
Distributed Processing, Client/Server, and Clusters
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
Reference: Message Passing Fundamentals.
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.
Image Processing Using Cilk 1 Parallel Processing – Final Project Image Processing Using Cilk Tomer Y & Tuval A (pp25)
Database Management Systems I Alex Coman, Winter 2006
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
What is Concurrent Programming? Maram Bani Younes.
Dr. Ken Hoganson, © August 2014 Programming in R STAT8030 Programming in R COURSE NOTES 1: Hoganson Programming Languages.
Computer System Architectures Computer System Software
Multithreaded Algorithms Andreas Klappenecker. Motivation We have discussed serial algorithms that are suitable for running on a uniprocessor computer.
1 Distributed Operating Systems and Process Scheduling Brett O’Neill CSE 8343 – Group A6.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Performance Evaluation of Parallel Processing. Why Performance?
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Network Aware Resource Allocation in Distributed Clouds.
STRATEGIC NAMING: MULTI-THREADED ALGORITHM (Ch 27, Cormen et al.) Parallelization Four types of computing: –Instruction (single, multiple) per clock cycle.
 2004 Deitel & Associates, Inc. All rights reserved. 1 Chapter 4 – Thread Concepts Outline 4.1 Introduction 4.2Definition of Thread 4.3Motivation for.
LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:
April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.
1 CS 140 : Feb 19, 2015 Cilk Scheduling & Applications Analyzing quicksort Optional: Master method for solving divide-and-conquer recurrences Tips on parallelism.
SOFTWARE DESIGN. INTRODUCTION There are 3 distinct types of activities in design 1.External design 2.Architectural design 3.Detailed design Architectural.
Lecture 13 Advanced Transaction Models. 2 Protocols considered so far are suitable for types of transactions that arise in traditional business applications,
Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.
Classic Model of Parallel Processing
©Silberschatz, Korth and Sudarshan14.1Database System Concepts - 6 th Edition Chapter 14: Transactions Transaction Concept Transaction State Concurrent.
13-1 Chapter 13 Concurrency Topics Introduction Introduction to Subprogram-Level Concurrency Semaphores Monitors Message Passing Java Threads C# Threads.
Static Process Scheduling
Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Nov 3, 2005.
Data Structures and Algorithms in Parallel Computing
CDP Tutorial 3 Basics of Parallel Algorithm Design uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
ECE 526 – Network Processing Systems Design Programming Model Chapter 21: D. E. Comer.
Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.
SMP Basics KeyStone Training Multicore Applications Literature Number: SPRPxxx 1.
Concurrency and Performance Based on slides by Henri Casanova.
Process by Dr. Amin Danial Asham. References Operating System Concepts ABRAHAM SILBERSCHATZ, PETER BAER GALVIN, and GREG GAGNE.
1/50 University of Turkish Aeronautical Association Computer Engineering Department Ceng 541 Introduction to Parallel Computing Dr. Tansel Dökeroğlu
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 28, 2005 Session 29.
1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.
1 Parallel Processing Fundamental Concepts. 2 Selection of an Application for Parallelization Can use parallel computation for 2 things: –Speed up an.
Chapter 4 – Thread Concepts
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
CILK: An Efficient Multithreaded Runtime System
Chapter 4 – Thread Concepts
Computer Engg, IIT(BHU)
The University of Adelaide, School of Computer Science
Abstract Major Cloud computing companies have started to integrate frameworks for parallel data processing in their product portfolio, making it easy for.
Multithreaded Programming in Cilk LECTURE 1
CSE8380 Parallel and Distributed Processing Presentation
Introduction to CILK Some slides are from:
Multithreaded Programming
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
Cilk and Writing Code for Hardware
Operating System Overview
Introduction to CILK Some slides are from:
Presentation transcript:

Juan Mendivelso

 Serial Algorithms: Suitable for running on an uniprocessor computer in which only one instruction executes at a time.  Parallel Algorithms: Run on a multiprocessor computer that permits multiple execution to execute concurrently.

 Computers with multiple processing units.  They can be:  Chip Multiprocessors: Inexpensive laptops/desktops. They contain a single multicore integrated-circuit that houses multiple processor “cores” each of which is a full-fledged processor with access to common memory.

 Computers with multiple processing units.  They can be:  Clusters: Build from individual computers with a dedicated network system interconnecting them. Intermediate price/performance.

 Computers with multiple processing units.  They can be:  Supercomputers: Combination of custom architectures and custom networks to deliver the highest performance (instructions per second). High price.

 Although the random-access machine model was early accepted for serial computing, no model has been established for parallel computing.  A major reason is that vendors have not agreed on a single architectural model for parallel computers.

 For example some parallel computers feature shared memory where all processors can access any location of memory.  Others employ distributed memory where each processor has a private memory.  However, the trend appears to be toward shared memory multiprocessor.

 Shared-memory parallel computers use static threading.  Software abstraction of “virtual processors” or threads sharing a common memory.  Each thread can execute code independently.  For most applications, threads persist for the duration of a computation.

 Programming a shared-memory parallel computer directly using static threads is difficult and error prone.  Dynamically partioning the work among the threads so that each thread receives approximately the same load turns out to be complicated.

 The programmer must use complex communication protocols to implement a scheduler to load-balance the work.  This has led to the creation of concurrency platforms. They provide a layer of software that coordinates, schedules and manages the parallel-computing resources.

 Class of concurrency platform.  It allows programmers to specify parallelism in applications without worrying about communication protocols, load balancing, etc.  The concurrency platform contains a scheduler that load-balances the computation automatically.

 It supports:  Nested parallelism: It allows a subroutine to be spawned, allowing the caller to proceed while the spawned subroutine is computing its result.  Parallel loops: regular for loops except that the iterations can be executed concurrently.

 The user only spicifies the logical parallelism.  Simple extension of the serial model with: parallel, spawn and sync.  Clean way to quantify parallelism.  Many multithreaded algorithms involving nested parallelism follow naturally from the Divide & Conquer paradigm.

 Fibonacci Example  The serial algorithm: Fib(n)  Repeated work  Complexity  However, recursive calls are independent!  Parallel algorithm: P-Fib(n)

 Concurrency keywords: spawn, sync and parallel  The serialization of a multithreaded algorithm is the serial algorithm that results from deleting the concurrency keywords.

 It occurs when the keyword spawn precedes a procedure call.  It differs from the ordinary procedure call in that the procedure instance that executes the spawn - the parent – may continue to execute in parallel with the spawn subroutine – its child - instead of waiting for the child to complete.

 It doesn’t say that a procedure must execute concurrently with its spawned children; only that it may!  The concurrency keywords express the logical parallelism of the computation.  At runtime, it is up to the scheduler to determine which subcomputations actually run concurrently by assigning them to processors.

 A procedure cannot safely use the values returned by its spawned children until after it executes a sync statement.  The keyword sync indicates that the procedure must wait until all its spawned children have been completed before proceeding to the statement after the sync.  Every procedure executes a sync implicitly before it returns.

 We can see a multithread computation as a directed acyclic graph G=(V,E) called a computational dag.  The vertices are instructions and and the edges represent dependencies between instructions, where (u,v) є E means that instruction u must execute before instruction v.

 If a chain of instructions contains no parallel control (no spawn, sync, or return), we may group them into a single strand, each of which represents one or more instructions.  Instructions involving parallel control are not included in strands, but are represented in the structure of the dag.

 For example, if a strand has two successors, one of them must have been spawned, and a strand with multiple predecessors indicates the predecessors joined because of a sync.  Thus, in the general case, the set V forms the set of strands, and the set E of directed edges represents dependencies between strands induced by parallel control.

 If G has a directed path from strand u to strand, we say that the two strands are (logically) in series. Otherwise, strands u and are (logically) in parallel.  We can picture a multithreaded computation as a dag of strands embedded in a tree of procedure instances.  Example!

 We can classify the edges:  Continuation edge : connects a strand u to its successor u’ within the same procedure instance.  Call edges: representing normal procedure calls.  Return edges: When a strand u returns to its calling procedure and x is the strand immediately following the next sync in the calling procedure.  A computation starts with an initial strand and ends with a single final strand.

 A parallel computer that consists of a set of processors and a sequential consistent shared memory.  Sequential consistent means that the shared memory behaves as if the multithreaded computation’s instructions were interleaved to produce alinear order that preserves the partial order of the computation dag.

 Depending on scheduling, the ordering could differ from one run of the program to another.  The ideal-parallel-computer model makes some performance assumptions:  Each processor in the machine has equal computing power  It ignores the cost of scheduling.

 Work:  Total time to execute the entire computation on one processor.  Sum of the times taken by each of the strands.  In the computational dag, it is the number of strands (assuming each strand takes a time unit).

 Span:  Longest time to execute thge strands along in path in the dag.  The span equals the number of vertices on a longest or critical path.  Example!

 The actual running time of a multithreaded computation depends also on how many processors are available and how the scheduler allocates strands to processors.  Running time on P processors: T P  Work: T 1  Span: T ∞ (unlimited number of processors)

 The work and span provide lower bound on the running time of a multithreaded computation T P on P processors:  Work law: T P ≥ T 1 /P  Span law: T P ≥ T ∞

 Speedup:  Speedup of a computation on P processors is the ratio T 1 /T P  How many times faster the computation is on P processors than on one processor.  It’s at most P.  Linear speedup: T 1 /T P = θ(P)  Perfect linear speedup: T 1 /T P =P

 Parallelism:  T 1 /T ∞  Average amount amount of work that can be performed in parallel for each step along the critical path.  As an upper bound, the parallelism gives the maximum possible speedup that can be achieved on any number of processors.  The parallelism provides a limit on the possibility of attaining perfect linear speedup.

 Good performance depends on more than minimizing the span and work.  The strands must also be scheduled efficiently onto the processors of the parallel machine.  On multithreaded programming model provides no way to specify which strands to execute on which processors. Instead, we rely on the concurrency platform’s scheduler.

 A multithreaded scheduler must schedule the computation with no advance knowledge of when strands will be spawned or when they will complete—it must operate on-line.  Moreover, a good scheduler operates in a distributed fashion, where the threads implementing the scheduler cooperate to load- balance the computation.

 To keep the analysis simple, we shall consider an on-line centralized scheduler, which knows the global state of the computation at any given time.  In particular, we shall consider greedy schedulers, which assign as many strands to processors as possible in each time step.

 If at least P strands are ready to execute during a time step, we say that the step is a complete step, and a greedy scheduler assigns any P of the ready strands to processors.  Otherwise, fewer than P strands are ready to execute, in which case we say that the step is an incomplete step, and the scheduler assigns each ready strand to its own processor.

 A greedy scheduler executes a multithreaded computation in time: T P ≤ T 1 /P + T ∞  Greedy scheduling is provably good becauses it achieves the sum of the lower bounds as an upper bound.  Besides it is within a factor of 2 of optimal.