Maximizing Speedup through Self-Tuning of Processor Allocation

Slides:



Advertisements
Similar presentations
© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.
Advertisements

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.
Memory access scheduling Authers: Scott RixnerScott Rixner,William J. Dally,Ujval J. Kapasi, Peter Mattson, John D. OwensWilliam J. DallyUjval J. KapasiPeter.
1 Chapter 4 Threads Threads: Resource ownership and execution.
Threads. Processes and Threads  Two characteristics of “processes” as considered so far: Unit of resource allocation Unit of dispatch  Characteristics.
MPI Program Performance. Introduction Defining the performance of a parallel program is more complex than simply optimizing its execution time. This is.
ECE 510 Brendan Crowley Paper Review October 31, 2006.
Operating Systems CSE 411 CPU Management Sept Lecture 11 Instructor: Bhuvan Urgaonkar.
Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.
An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos.
Chapter 4 Performance. Times User CPU time – Time that the CPU is executing the program System CPU time – time the CPU is executing OS routines for the.
Performance Evaluation of Parallel Processing. Why Performance?
AADEBUG MUNCHEN Non-intrusive on-the-fly data race detection using execution replay Michiel Ronsse - Koen De Bosschere Ghent University - Belgium.
Multiprocessor and Real-Time Scheduling Chapter 10.
Thread-Level Speculation Karan Singh CS
Adaptive Multi-Threading for Dynamic Workloads in Embedded Multiprocessors 林鼎原 Department of Electrical Engineering National Cheng Kung University Tainan,
Executing Parallel Programs with Potential Bottlenecks Efficiently Yoshihiro Oyama Kenjiro Taura Akinori Yonezawa {oyama, tau,
Lecture 5: Threads process as a unit of scheduling and a unit of resource allocation processes vs. threads what to program with threads why use threads.
Chapter 4: Multithreaded Programming. 4.2 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts What is Thread “Thread is a part of a program.
Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction University of California MICRO ’03 Presented by Jinho Seol.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
A System Performance Model Distributed Process Scheduling.
Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.
Michael J. Voss and Rudolf Eigenmann PPoPP, ‘01 (Presented by Kanad Sinha)
1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.
Threads prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University 1July 2016Processes.
These slides are based on the book:
Auburn University
Processes and threads.
PROCESS MANAGEMENT IN MACH
CS 6560: Operating Systems Design
Processes and Threads Processes and their scheduling
Day 12 Threads.
Multi-core processors
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.
Other Important Synchronization Primitives
Parallel and Distributed Simulation Techniques
Lecture 21 Concurrency Introduction
Effective Data-Race Detection for the Kernel
Chapter 4: Threads.
Amir Kamil and Katherine Yelick
EE 193: Parallel Computing
CPSC 531: System Modeling and Simulation
Hardware Multithreading
Objective of This Course
STUDY AND IMPLEMENTATION
Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle
Operating System 4 THREADS, SMP AND MICROKERNELS
Threads and Concurrency
Threads Chapter 4.
Gary M. Zoppetti Gagan Agrawal
A Self-Tuning Configurable Cache
Multiprocessor and Real-Time Scheduling
COMP60621 Fundamentals of Parallel and Distributed Systems
Multithreaded Programming
Amir Kamil and Katherine Yelick
Hardware Multithreading
Uniprocessor scheduling
CS703 - Advanced Operating Systems
IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale
Mapping DSP algorithms to a general purpose out-of-order processor
Chapter 4: Simulation Designs
Parallelism Can we make it faster? 8-May-19.
Parallel Programming in C with MPI and OpenMP
Gary M. Zoppetti Gagan Agrawal Rishi Kumar
Operating System Overview
COMP60611 Fundamentals of Parallel and Distributed Systems
CS703 – Advanced Operating Systems
Presentation transcript:

Maximizing Speedup through Self-Tuning of Processor Allocation Thu D. Nguyen, Raj Vaswani, and John Zahorjan University of Washington Presenter : Minsu Kim 17 Dec 2015

Introduction Ideal vs Reality Parallelization overhead System overhead Idleness Communication Dynamically Maximize Speedup (Self-tuning!) Static allocation may not be optimal for entire job

Overview Introduction Experimental Environments Self-Tuning Algorithms Basic Change-driven Time-driven Multi-Phase Self-Tuning Algorithms Independent (IMPST) Inter-dependent (DMPST) Conclusion

Experimental Environments KSR-2 COMA shared memory multiprocessor OS: OSF/1 (UNIX variant) H/W monitoring unit available on each node of the KSR-2 Performing runtime measurements Event monitor : cache misses, processor stall time, etc. Ten parallel applications Five Hand-coded parallel applications Five Complier-parallelized sequential programs Only consider iterative parallel applications

Runtime Measurement Chosen runtime metric : Efficiency Three critical HW counters (system overhead, processor stall time) Elapsed wall-clock time Elapsed user-mode execution time Accumulated processor stall time Reading these three register at the beginning and end of each iter. Measuring idleness Instrument all thread synchronization code to track elapsed idle time Assume all application synchronization are invoked by calls to certain libraries.

Runtime Measurement Efficiency = 1 - Loss Speedup = (# of processors) x efficiency Parallelization overhead System overhead Idleness Processor Stall Typically Small

Self-Tuning Algorithms MGB Searches for the maximum Iteration 1 : Search domain [1, P] => (reduced) [S(P), P] Keep this iterative while narrowing the interval Extension to non-unimodal S functions Heuristic : a Greedy Algorithm Simply continue the search in the larg- est subinterval where the largest S has found so far. May not correct yet quite practical. S Step 1 S(P) P

Advanced Self-Tuning Algorithms Change-driven self-tuning Algorithm Continuously monitors job efficiency and re-initiates the search procedure whenever it notices a significant change in efficiency Time-driven self-tuning Algorithm Includes change-driven self-tuning Will also rerun the search procedure periodically regardless of change in job efficiency Considering the possibility that job efficiency changes in the middle of a change-driven self-tuning search

Performance Self-tuning imposes very little overhead Basic self-tuning can significantly improve performance over no-tuning

Performance Time-driven self-tuning is not useful for the programs here The performance benefit of self-tuning can be limited by the cost of probes

Multi-phase Self-tuning One Iteration = Multiple Parallel Phases Phase : a specific piece of code Speedup of each phase may be maximized in different # of processors! Extension of the problem In an iteration there are N phases Find a processor allocation vector (p1, p2, …, pN) which maximizes S Actually, basic self-tuning algorithm = Find (p, p, …, p)

Multi-phase Self-tuning Independent multi-phase self-tuning (IMPST) Merely apply basic self-tuning to each phase independently Just a naïve, simple extension Problem : Performance of each phase ALSO depends on the allocations for other phases. Inter-dependent multi-phase self-tuning (DMPST) Randomized search technique Choosing an initial candidate allocation vector Selecting a new candidate vector (random multiplied) Evaluating and accepting new candidate vectors until steady state Terminating the search

Performance Multi-phase techniques are able to achieve performance not realizable by any fixed allocation Inter-dependent self-tuning yields better performance than any other.

Conclusion Maximizing application speedup through runtime, self-selection of an appropriate number of processors on which to run Based on ability to measure program inefficiencies (HW support) Peak speedups are data or time dependent Simple search procedures can automatically select appropriate numbers of processors Relieves the user from the burden of determining the precise number of processors to use for each input data set Potential to outperform any static allocation