Processor-oblivious parallel algorithms and scheduling Illustration on parallel prefix Jean-Louis Roch, Daouda Traore INRIA-CNRS Moais team - LIG Grenoble,

Slides:



Advertisements
Similar presentations
Hadi Goudarzi and Massoud Pedram
Advertisements

Scheduling in Distributed Systems Gurmeet Singh CS 599 Lecture.
1 Processor-oblivious parallel algorithms with provable performances - Applications Jean-Louis Roch Lab. Informatique Grenoble, INRIA, France.
EDA (CS286.5b) Day 10 Scheduling (Intro Branch-and-Bound)
Lookahead. Outline Null message algorithm: The Time Creep Problem Lookahead –What is it and why is it important? –Writing simulations to maximize lookahead.
U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and.
SIAM Parallel Processing’ Feb 22 Mini Symposium Adaptive Algorithms for Scientific computing 9h45 Adaptive algorithms - Theory and applications Jean-Louis.
CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad.
A system Performance Model Instructor: Dr. Yanqing Zhang Presented by: Rajapaksage Jayampthi S.
Presenter: David Fleeman { D. Juedes, F. Drews, L. Welch and D. Fleeman Center for Intelligent, Distributed & Dependable.
11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.
On-line adaptive parallel prefix computation Jean-Louis Roch, Daouda Traoré and Julien Bernard Presented by Andreas Söderström, ITN.
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
Optimal Fast Hashing Yossi Kanizo (Technion, Israel) Joint work with Isaac Keslassy (Technion, Israel) and David Hay (Politecnico di Torino, Italy)
1 Introduction to Load Balancing: l Definition of Distributed systems. Collection of independent loosely coupled computing resources. l Load Balancing.
Courseware Basics of Real-Time Scheduling Jan Madsen Informatics and Mathematical Modelling Technical University of Denmark Richard Petersens Plads, Building.
1 Parallel Algorithms Design and Implementation Jean-Louis.Roch at imag.fr MOAIS / Lab. Informatique Grenoble, INRIA, France.
Summary :- Distributed Process Scheduling Prepared BY:- JAYA KALIDINDI.
Computer System Architectures Computer System Software
Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.
International Graduate School of Dynamic Intelligent Systems, University of Paderborn Improved Algorithms for Dynamic Page Migration Marcin Bieńkowski.
Lecture 3 – Parallel Performance Theory - 1 Parallel Performance Theory - 1 Parallel Computing CIS 410/510 Department of Computer and Information Science.
Scheduling. Objectives – Fairness – Maximize throughput – Maximize the number of users receiving acceptable response times – Minimize overhead – Balance.
 Introduction to Operating System Introduction to Operating System  Types Of An Operating System Types Of An Operating System  Single User Single User.
An Autonomic Framework in Cloud Environment Jiedan Zhu Advisor: Prof. Gagan Agrawal.
CS453 Lecture 3.  A sequential algorithm is evaluated by its runtime (in general, asymptotic runtime as a function of input size).  The asymptotic runtime.
Budget-based Control for Interactive Services with Partial Execution 1 Yuxiong He, Zihao Ye, Qiang Fu, Sameh Elnikety Microsoft Research.
Scheduling. Alternating Sequence of CPU And I/O Bursts.
Multiprocessor and Real-Time Scheduling Chapter 10.
Progress Report 2014/02/12. Previous in IPDPS’14 Energy-efficient task scheduling on per- core DVFS architecture ◦ Batch mode  Tasks with arrival time.
1 University of Maryland Linger-Longer: Fine-Grain Cycle Stealing in Networks of Workstations Kyung Dong Ryu © Copyright 2000, Kyung Dong Ryu, All Rights.
Many-SC Project Runtime Environment (RTE) CSAP Lab 2014/10/28.
CPU Scheduling CSCI 444/544 Operating Systems Fall 2008.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
An Optimal Service Ordering for a World Wide Web Server A Presentation for the Fifth INFORMS Telecommunications Conference March 6, 2000 Amy Csizmar Dalal.
Mini Symposium Adaptive Algorithms for Scientific computing 9h45 Adaptive algorithms - Theory and applications Jean-Louis Roch &al. AHA Team INRIA-CNRS.
U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science Performance of Work Stealing in Multiprogrammed Environments Matthew Hertz Department.
Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.
1 CS 140 : Feb 19, 2015 Cilk Scheduling & Applications Analyzing quicksort Optional: Master method for solving divide-and-conquer recurrences Tips on parallelism.
On the Performance of Window-Based Contention Managers for Transactional Memory Gokarna Sharma and Costas Busch Louisiana State University.
Dynamic Voltage Frequency Scaling for Multi-tasking Systems Using Online Learning Gaurav DhimanTajana Simunic Rosing Department of Computer Science and.
Cpr E 308 Spring 2005 Process Scheduling Basic Question: Which process goes next? Personal Computers –Few processes, interactive, low response time Batch.
Computing Simulation in Orders Based Transparent Parallelizing Pavlenko Vitaliy Danilovich, Odessa National Polytechnic University Burdeinyi Viktor Viktorovych,
1 5. Abstract Data Structures & Algorithms 5.6 Algorithm Evaluation.
On-line adaptative parallel prefix computation Jean-Louis Roch, Daouda Traore, Julien Bernard INRIA-CNRS Moais team - LIG Grenoble, France Contents I.
Static Process Scheduling
CDP Tutorial 3 Basics of Parallel Algorithm Design uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison.
Futures, Scheduling, and Work Distribution Speaker: Eliran Shmila Based on chapter 16 from the book “The art of multiprocessor programming” by Maurice.
Scheduling Multithreaded Computations By Work-Stealing Robert D. Blumofe The University of Texas, Austin Charles E. Leiserson, MIT Laboratory for Computer.
Silberschatz, Galvin and Gagne ©2011 Operating System Concepts Essentials – 8 th Edition Chapter 2: The Linux System Part 3.
Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.
ICDCS 05Adaptive Counting Networks Srikanta Tirthapura Elec. And Computer Engg. Iowa State University.
Determining Optimal Processor Speeds for Periodic Real-Time Tasks with Different Power Characteristics H. Aydın, R. Melhem, D. Mossé, P.M. Alvarez University.
Scheduling Parallel DAG Jobs to Minimize the Average Flow Time K. Agrawal, J. Li, K. Lu, B. Moseley.
- DAG Scheduling with Reliability - - GridSolve - - Fault Tolerance In Open MPI - Asim YarKhan, Zhiao Shi, Jack Dongarra VGrADS Workshop April 2007.
CHaRy Software Synthesis for Hard Real-Time Systems
JICOS A Java-Centric Distributed Computing Service
Introduction to Load Balancing:
Distributed Processors
Conception of parallel algorithms
Processes and Threads Processes and their scheduling
Parallel and Distributed Simulation Techniques
serge guelton & jl Pazat
Chapter 2 Scheduling.
Chapter 2: The Linux System Part 3
Multiprocessor and Real-Time Scheduling
Hardware Multithreading
Atlas: An Infrastructure for Global Computing
Scheduling 21 May 2019.
CPU Scheduling David Ferry CSCI 3500 – Operating Systems
Presentation transcript:

Processor-oblivious parallel algorithms and scheduling Illustration on parallel prefix Jean-Louis Roch, Daouda Traore INRIA-CNRS Moais team - LIG Grenoble, France Contents I. What is a processor-oblivious parallel algorithm ? II. Work-stealing scheduling of parallel algorithms III.Processor-oblivious parallel prefix computation Workshop “Scheduling Algorithms for New Emerging Applications” - CIRM Luminy -May 29th-June 2nd, 2006

Dynamic architecture : non-fixed number of resources, variable speeds eg: grid, … but not only: SMP server in multi-users mode The problem Problem: compute f(a) Sequential algorithm parallel P=2 parallel P=100 parallel P=max Multi-user SMP serverGridHeterogeneous network ? Which algorithm to choose ? ……

Dynamic architecture : non-fixed number of resources, variable speeds eg: grid, … but not only: SMP server in multi-users mode => motivates « processor-oblivious » parallel algorithm that : + is independent from the underlying architecture: no reference to p nor  i (t) = speed of processor i at time t nor … + on a given architecture, has performance guarantees : behaves as well as an optimal (off-line, non-oblivious) one Problem: often, the larger the parallel degree, the larger the #operations to perform ! Processor-oblivious algorithms

Prefix problem : input : a 0, a 1, …, a n output :  0,  1, …,  n with Sequential algorithm : for (i= 0 ; i <= n; i++ )  [ i ] =  [ i – 1 ] * a [ i ] ; Fine grain optimal parallel algorithm [Ladner-Fischer] : Prefix computation Critical time W  =2. log n but performs W 1 = 2.n ops Twice more expensive than the sequential … a 0 a 1 a 2 a 3 a 4 … a n-1 a n **** Prefix of size n/2  1  3 …  n  2  4 …  n-1 *** performs W 1 = W  = n operations

Any parallel algorithm with critical time W  runs on p processors in time – strict lower bound : block algorithm + pipeline [Nicolau&al. 1996] –Question : How to design a generic parallel algorithm, independent from the architecture, that achieves optimal performance on any given architecture ? –> to design a malleable algorithm where scheduling suits the number of operations performed to the architecture Prefix computation : an example where parallelism always costs

- Heterogeneous processors with changing speed [Bender-Rabin02] =>  i (t) = instantaneous speed of processor i at time t in #operations per second - Average speed per processor for a computation with duration T : - Lower bound for the time of prefix computation : Architecture model

Work-stealing (1/2) «Depth » W  = #ops on a critical path (parallel time on   resources) Workstealing = “ greedy ” schedule but distributed and randomized Each processor manages locally the tasks it creates When idle, a processor steals the oldest ready task on a remote -non idle- victim processor (randomly chosen) « Work » W 1 = #total operations performed

Work-stealing (2/2) «Depth » W  = #ops on a critical path (parallel time on   resources) « Work » W 1 = #total operations performed Interests : -> suited to heterogeneous architectures with slight modification [Bender-Rabin02] -> if W  small enough near-optimal processor-oblivious schedule with good probability on p processors with average speeds  ave NB : #succeeded steals = #task migrations < p W  [Blumofe 98, Narlikar 01, Bender 02] Implementation: work-first principle [Cilk serie-parallel, Kaapi dataflow] -> Move scheduling overhead on the steal operations (infrequent case) -> General case : “ local parallelism ” implemented by sequential function call

General approach: to mix both a sequential algorithm with optimal work W 1 and a fine grain parallel algorithm with minimal critical time W  Folk technique : parallel, than sequential Parallel algorithm until a certain « grain »; then use the sequential one Drawback : W  increases ;o) …and, also, the number of steals Work-preserving speed-up technique [Bini-Pan94] sequential, then parallel Cascading [Jaja92] : Careful interplay of both algorithms to build one with both W  small and W 1 = O( W seq ) Use the work-optimal sequential algorithm to reduce the size Then use the time-optimal parallel algorithm to decrease the time Drawback : sequential at coarse grain and parallel at fine grain ;o( How to get both optimal work W 1 and W  small?

Alternative : concurrently sequential and parallel Based on the Work-first principle : Executes always a sequential algorithm to reduce parallelism overhead  use parallel algorithm only if a processor becomes idle (ie steals) by extracting parallelism from a sequential computation Hypothesis : two algorithms : - 1 sequential : SeqCompute - 1 parallel : LastPartComputation : at any time, it is possible to extract parallelism from the remaining computations of the sequential algorithm – Self-adaptive granularity based on work-stealing SeqCompute Extract_par LastPartComputation SeqCompute

Parallel Sequential   0 a 1 a 2 a 3 a 4 a 5 a 6 a 7 a 8 a 9 a 10 a 11 a 12 Work- stealer 1 Main Seq.  Work- stealer 2 Adaptive Prefix on 3 processors 11 Steal request

Parallel Sequential Adaptive Prefix on 3 processors   0 a 1 a 2 a 3 a 4 Work- stealer 1 Main Seq. 11  Work- stealer 2  a 5 a 6 a 7 a 8 a 9 a 10 a 11 a 12 77 33 Steal request 22 66  i =a 5 *…*a i

Parallel Sequential Adaptive Prefix on 3 processors   0 a 1 a 2 a 3 a 4 Work- stealer 1 Main Seq. 11  Work- stealer 2  a 5 a 6 a 7 a 8 77 33 44 22 66  i =a 5 *…*a i a 9 a 10 a 11 a 12 88 44 Preempt  10  i =a 9 *…*a i 88 88

Parallel Sequential Adaptive Prefix on 3 processors   0 a 1 a 2 a 3 a 4  8 Work- stealer 1 Main Seq. 11  Work- stealer 2  a 5 a 6 a 7 a 8 77 33 44 22 66  i =a 5 *…*a i a 9 a 10 a 11 a 12 88 55  10  i =a 9 *…*a i 99 66  11 88 Preempt  11 88

Parallel Sequential Adaptive Prefix on 3 processors   0 a 1 a 2 a 3 a 4  8  11 a 12 Work- stealer 1 Main Seq. 11  Work- stealer 2  a 5 a 6 a 7 a 8 77 33 44 22 66  i =a 5 *…*a i a 9 a 10 a 11 a 12 88 55  10  i =a 9 *…*a i 99 66  11  12  10 77  11 88

Parallel Sequential Adaptive Prefix on 3 processors   0 a 1 a 2 a 3 a 4  8  11 a 12 Work- stealer 1 Main Seq. 11  Work- stealer 2  a 5 a 6 a 7 a 8 77 33 44 22 66  i =a 5 *…*a i a 9 a 10 a 11 a 12 88 55  10  i =a 9 *…*a i 99 66  11  12  10 77  11 88 Implicit critical path on the sequential process

Analysis of the algorithm Execution time Sketch of the proof : –Dynamic coupling of two algorithms that completes simultaneously: –Sequential: (optimal) number of operations S on one processor –Parallel : minimal time but performs X operations on other processors dynamic splitting always possible till finest grain BUT local sequential –Critical path small ( eg : log X) –Each non constant time task can potentially be splitted (variable speeds) –Algorithmic scheme ensures T s = T p + O(log X) => enables to bound the whole number X of operations performed and the overhead of parallelism = (s+X) - #ops_optimal Lower bound

Adaptive prefix : experiments1 Single-user context : processor-oblivious prefix achieves near-optimal performance : - close to the lower bound both on 1 proc and on p processors - Less sensitive to system overhead : even better than the theoretically “optimal” off-line parallel algorithm on p processors : Optimal off-line on p procs Oblivious Prefix sum of double on a SMP 8 procs (IA64 1.5GHz/ linux) Time (s) #processors Pure sequential Single user context

Adaptive prefix : experiments 2 Multi-user context : Additional external charge: (9-p) additional external dummy processes are concurrently executed Processor-oblivious prefix computation is always the fastest 15% benefit over a parallel algorithm for p processors with off-line schedule, Multi-user context : Additional external charge: (9-p) additional external dummy processes are concurrently executed Processor-oblivious prefix computation is always the fastest 15% benefit over a parallel algorithm for p processors with off-line schedule, External charge (9-p external processes) Off-line parallel algorithm for p processors Oblivious Prefix sum of double on a SMP 8 procs (IA64 1.5GHz/ linux) Time (s) #processors Multi-user context :

Conclusion The interplay of an on-line parallel algorithm directed by work-stealing schedule is useful for the design of processor-oblivious algorithms Application to prefix computation : - theoretically reaches the lower bound on heterogeneous processors with changing speeds - practically, achieves near-optimal performances on multi-user SMPs Generic adaptive scheme to implement parallel algorithms with provable performance - work in progress : parallel 3D reconstruction [oct-tree scheme with deadline constraint]

Thank you ! Interactive Distributed Simulation [B Raffin &E Boyer] - 5 cameras, - 6 PCs 3D-reconstruction + simulation + rendering ->Adaptive scheme to maximize 3D-reconstruction precision within fixed timestamp

The Prefix race: sequential/parallel fixed/ adaptive Adaptative 8 proc. Parallel 8 proc. Parallel 7 proc. Parallel 6 proc. Parallel 5 proc. Parallel 4 proc. Parallel 3 proc. Parallel 2 proc. Sequential On each of the 10 executions, adaptive completes first

Adaptive prefix : some experiments Single user context Adaptive is equivalent to : - sequential on 1 proc - optimal parallel-2 proc. on 2 processors - … - optimal parallel-8 proc. on 8 processors Multi-user context Adaptive is the fastest 15% benefit over a static grain algorithm Multi-user context Adaptive is the fastest 15% benefit over a static grain algorithm External charge Parallel Adaptive Parallel Adaptive Prefix of elements on a SMP 8 procs (IA64 / linux) #processors Time (s) #processors

With * = double sum ( r[i]=r[i-1] + x[i] ) Single userProcessors with variable speeds Remark for n= doubles : - “pure” sequential : 0,20 s - minimal ”grain” = 100 doubles : 0.26s on 1 proc and on 2 procs (close to lower bound) Finest “grain” limited to 1 page = octets = 2048 double