Efficient Multiprogramming for Multicores with SCAF Published in MICRO-46, December 2013. Published by Timothy Creech, Aparna Kotha and Rajeev Barua. Presented.

Slides:



Advertisements
Similar presentations
CPU Scheduling.
Advertisements

Processes Management.
© 2004, D. J. Foreman 1 Scheduling & Dispatching.
On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.
Operating Systems: Internals and Design Principles
CPU Scheduling Tanenbaum Ch 2.4 Silberchatz and Galvin Ch 5.
Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,
Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
Introduction CSCI 444/544 Operating Systems Fall 2008.
CPU Scheduling CS 3100 CPU Scheduling1. Objectives To introduce CPU scheduling, which is the basis for multiprogrammed operating systems To describe various.
PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.
Fast Paths in Concurrent Programs Wen Xu, Princeton University Sanjeev Kumar, Intel Labs. Kai Li, Princeton University.
Reference: Message Passing Fundamentals.
CS 3013 & CS 502 Summer 2006 Scheduling1 The art and science of allocating the CPU and other resources to processes.
3.5 Interprocess Communication
1 Slice-Balancing H.264 Video Encoding for Improved Scalability of Multicore Decoding Michael Roitzsch Technische Universität Dresden ACM & IEEE international.
 2004 Deitel & Associates, Inc. All rights reserved. Chapter 4 – Thread Concepts Outline 4.1 Introduction 4.2Definition of Thread 4.3Motivation for Threads.
Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Lecture 5 Operating Systems.
OPERATING SYSTEMS CPU SCHEDULING.  Introduction to CPU scheduling Introduction to CPU scheduling  Dispatcher Dispatcher  Terms used in CPU scheduling.
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
 Introduction to Operating System Introduction to Operating System  Types Of An Operating System Types Of An Operating System  Single User Single User.
Memory Management. Process must be loaded into memory before being executed. Memory needs to be allocated to ensure a reasonable supply of ready processes.
Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang.
Oct Multi-threaded Active Objects Ludovic Henrio, Fabrice Huet, Zsolt Istvàn June 2013 –
1 Previous lecture review n Out of basic scheduling techniques none is a clear winner: u FCFS - simple but unfair u RR - more overhead than FCFS may not.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
1 Multiprocessor and Real-Time Scheduling Chapter 10 Real-Time scheduling will be covered in SYSC3303.
 2004 Deitel & Associates, Inc. All rights reserved. 1 Chapter 4 – Thread Concepts Outline 4.1 Introduction 4.2Definition of Thread 4.3Motivation for.
1 Process Scheduling in Multiprocessor and Multithreaded Systems Matt Davis CS5354/7/2003.
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
Chapter 2 Processes and Threads Introduction 2.2 Processes A Process is the execution of a Program More specifically… – A process is a program.
Lecture 5: Threads process as a unit of scheduling and a unit of resource allocation processes vs. threads what to program with threads why use threads.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
Department of Computer Science and Software Engineering
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Page 1 2P13 Week 1. Page 2 Page 3 Page 4 Page 5.
Lecture 4 CPU scheduling. Basic Concepts Single Process  one process at a time Maximum CPU utilization obtained with multiprogramming CPU idle :waiting.
CPU scheduling.  Single Process  one process at a time  Maximum CPU utilization obtained with multiprogramming  CPU idle :waiting time is wasted 2.
Processes Chapter 3. Processes in Distributed Systems Processes and threads –Introduction to threads –Distinction between threads and processes Threads.
Advanced Operating Systems CS6025 Spring 2016 Processes and Threads (Chapter 2)
Chapter 4 – Thread Concepts
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
Introduction to threads
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
Threads & Multithreading
Why Events Are A Bad Idea (for high-concurrency servers)
Processes and Threads Processes and their scheduling
OPERATING SYSTEMS CS3502 Fall 2017
Chapter 4 – Thread Concepts
Process Management Presented By Aditya Gupta Assistant Professor
Where are being used the OS?
Chapter 4: Threads.
Department of Computer Science University of California, Santa Barbara
Lecture 21: Introduction to Process Scheduling
Support for ”interactive batch”
Chapter5: CPU Scheduling
Operating systems Process scheduling.
CPU SCHEDULING.
Threads Chapter 4.
Chapter 6: CPU Scheduling
Multithreaded Programming
Operating System Introduction.
Chapter 4: Threads & Concurrency
Lecture 21: Introduction to Process Scheduling
Chapter 2 Operating System Overview
Department of Computer Science University of California, Santa Barbara
Operating System Overview
Presentation transcript:

Efficient Multiprogramming for Multicores with SCAF Published in MICRO-46, December Published by Timothy Creech, Aparna Kotha and Rajeev Barua. Presented By: - Abhishek Mishra (13IS01F) - Ankit Patidar (13IS03F)

What is in the presentation: Problem Statement Motivation behind solving the problem Introduction Existing solutions to the problem Problems faced by existing solutions Solution presented in the paper Performance Analysis Conclusion

Problem Statement: Now-a-days scenario: Hardware is becoming increasingly parallel and parallel Applications are appreciated more. Applications containing multiple Malleable Processes want sophisticated and flexible resource management. Malleable processes: which can vary the number of threads used when they run. Before SCAF, No good strategy had been deployed for intelligent allocation of hardware threads => causing Downgrade in Program Efficiency.

Motivation behind solving the problem: Running multiple parallelized programs on many core system makes machine quickly oversubscribed (no resource utilization strategy). Oversubscribed: if no. of computationally intensive threads exceeds no. of h/w contexts. Problem of space sharing of cores, if we have more cores than applications (no criteria exists). Solutions: > Time share the h/w resources by Context Switching (poor solution). CS incurs more overhead, long waiting times reducing system throughput. > Using some synchronization techniques like Spinlocks. Spinlocks requires dedicated h/w for reasonable performance. A Run-time allocation decision is needed based on observed efficiency without any paradigm change, program modifications, profiling or recompilation (approximating space sharing => LOAD BALANCING).

Introduction: Parallel efficiency of executed code, E=S/P; where executing the code in parallel on P h/w contexts yielded a speedup S over serial execution. MAIN TASK- Maximizing E for all processes in a system. Maximizing E => Maximizing Speedup achieved, S. Avg. h/w contexts contributes large speed up, so space sharing of threads is approximated (Load Balancing). In case of Truly malleable parallel programs, it is hard to change the no. of s/w threads at runtime i.e. Load Balancing is not easy) Making such a decision automatically for malleable processes is SCAF (SCheduling and Allocation with Feedback).

Introduction contd. Following requirements need to be satisfied by SCAF: Total system efficiency optimized. No modification or recompilation. Effectiveness in both batch and real time processing scenarios. System load from both processes (truly malleable and not truly malleable), is taken into account (since SCAF only accounts for malleable processes). SCAF seeks to solve performance and administrative problems related to execution of multiple multithreaded application on multi core machines.

Existing solutions to the problem: 1.Controlling Oversubscription by modifying systems thread package: Tucker et al modified a Thread package used on Encore Multimax, and created a centralized daemon, which suspend the no. of running threads on the system to avoid oversubscription, when necessary. Disadvantages: > no run-time performance measurements taken into account => not good for malleable processes. > use of some specific parallel paradigm, where programmer has to create a queue of tasks to be executed by threads => manual intervention is needed. By modifying only systems thread package, many programs were supported but things were not fully automated !!!! Reference: A. Tucker and A. Gupta, Process control and scheduling issues for multiprogrammed shared-memory multiprocessors," in Proceedings of the twelfth ACM symposium on Operating systems principles, ser. SOSP89.

Existing solutions to the problem contd. 2.Load Balancing by creating explicit Worker-Threads : Arora et al designed a strictly user-level, work-stealing thread scheduler, which creates certain no. of worker threads, which are allowed to steal work from one another in order to load balance. Work-stealing: Process in which programmer specifies all parallelism in a declarative manner and then worker threads come into picture. Disadvantages: > Independent worker threads for each process leads to more worker threads than h/w contexts => Oversubscription Problem. > Relies on work stealing programming model => does not take advantage of malleability. > Implementation point of view: implementation is quite hard and difficult. Reference: N. S. Arora, R. D. Blumofe, and C. G. Plaxton, Thread scheduling for multiprogrammed multiprocessors," in Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures, ser. SPAA '98.

Existing solutions to the problem contd. 3. Run-time allocation decision based system: McFarland created a prototype system called RSM, which includes a programming API. RSM daemon attempts to make allocation decisions according to run-time observations of work done by each process. Processes are given larger allocations if they perform more useful work. Disadvantages: > RSM only considers the absolute IPC of each process, while SCAF considers efficiency observed at run-time. > Program Recompilation need to be done. Reference: D. J. McFarland, Exploiting Malleable Parallelism on Multicores Systems. [Blacksburg, Va: University Libraries, Virginia Polytechnic Institute and State University, 2011.

SCAF vs. other related implementations: Fig1: Feature comparison of related implementations.

Solution presented in the paper: SCheduling and Allocation with Feedback (SCAF) system is presented in the paper, having following characteristics: a drop-in runtime solution. supports existing malleable applications. decision is based on observed efficiency of processes. no paradigm change, no recompilation or profiling of processes. SCAF daemon is presented, which will be estimating process efficiency purely at runtime using h/w counters, these estimated process efficiency will be used later in allocation decisions.

How SCAF does the task? For accounting the efficiency of each and every process in the system makes use of profiling. We know already Parallel Efficiency, E=S/P, where S= speedup, P= no. of h/w contexts. PROBLEM: To calculate S, we must know serial execution time. However in a parallel program, serial measurements are not available directly. How to find this serial execution time at low cost? SOLUTION: We estimate the serial execution time by cloning the parallel process into a serial experimental process dynamically. The serial experiment is run concurrently with the parallel code as long as the parallel code executes. The parallel process runs on N-1 cores, and the serial process runs on 1 core. N : Number of threads the parallel process has been allocated. NOTE : This will not give the correct value but will give a good estimate of the serial execution time.

Sharing Policies used by SCAF: Following policies are implemented by SCAF daemon: Minimizing the Make Span i.e. minimizing the total amount of time to complete all jobs. Equipartitioning i.e. fair sharing of hardware resources (initially). Where N = no. of h/w contexts available, k = no. of processes running, And P j = Threads allocated to process j. Maximizing the Sum Speedup i.e. maximize the total sum of speedups achieved by the running processes. Maximizing the Sum Speedup Based on Runtime Feedback. i.e. SCAF clients maintain and report a single efficiency estimate per process, so that load balancing and intelligent allocation of resources can be done dynamically.

Efficiency estimate used by SCAF daemon Efficiency estimate allows SCAF daemon to reason about how efficiently each process makes use of more cores relative to other processes. By cloning a serial process, we find the estimated value of serial execution of program. The daemon uses this efficiency estimate to build a simple speedup model. Where E j is reported parallel efficiency from process j, and P j is previous allocation for j. NOTE: C j is constant factor which gives us the recent information of resource usage of process j, i.e. feedback about process j. Each round of feedback from client accounts useful information about space sharing of processes and distribution of load among hardware contexts.

Working of SCAF daemon: Fig2: Parallel section with lightweight serial experiment.

Example: Given figure, No. of hardware contexts=16 2 processes : foo and baz. foo efficiency: 2/3 on 12 threads. baz efficiency: 3/8 on 4 threads. Applying speedup model we can see that C foo = 6 and C baz = 1 and new allocations are computed as, P foo = 14 and P baz = 2. If resulting feedback indicates a good match with the predicted model, then the same model and solution will be maintained and allocation will remain the same. Fig3: Runtime feedback loop in SCAF

Performance Analysis of SCAF: Evaluation of SCAF has been done using NAS NPB parallel benchmarks. On concurrently running, 70% of benchmarks pairs on 8-core Xeon processor saw improvements averaging 15% in sum of speedups compared to equipartitioning. For a 64-context Sparc T2 processor, 57% of the benchmarks pairs saw a similar 15% improvement over equipartitioning.

Performance Analysis of SCAF Contd. Fig4(a):Results on a dual Intel Xeon E5410 with 8 hardware contexts. Fig4(b):Results on a Sparc T2 processor with 64 hardware contexts.