Adaptive Multi-Threading for Dynamic Workloads in Embedded Multiprocessors 林鼎原 Department of Electrical Engineering National Cheng Kung University Tainan,

Slides:

Advertisements

Similar presentations

Interactive lesson about operating system

Advertisements

Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

Autonomic Systems Justin Moles, Winter 2006 Enabling autonomic behavior in systems software with hot swapping Paper by: J. Appavoo, et al. Presentation.

1 A Self-Tuning Configurable Cache Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

Dynamic Load Balancing for VORPAL Viktor Przebinda Center for Integrated Plasma Studies.

Fundamentals of Python: From First Programs Through Data Structures

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

Performance Analysis of Multiprocessor Architectures

1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.

1 Optimizing Utility in Cloud Computing through Autonomic Workload Execution Reporter : Lin Kelly Date : 2010/11/24.

ECE 510 Brendan Crowley Paper Review October 31, 2006.

1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Department of Electrical Engineering National Cheng Kung University

Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.

1 Route Table Partitioning and Load Balancing for Parallel Searching with TCAMs Department of Computer Science and Information Engineering National Cheng.

Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.

Chapter 3 Memory Management: Virtual Memory

Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Focused Matrix Factorization for Audience Selection in Display Advertising BHARGAV KANAGAL, AMR AHMED, SANDEEP PANDEY, VANJA JOSIFOVSKI, LLUIS GARCIA-PUEYO,

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Sigrity, Inc © Efficient Signal and Power Integrity Analysis Using Parallel Techniques Tao Su, Xiaofeng Wang, Zhengang Bai, Venkata Vennam Sigrity,

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

A performance evaluation approach openModeller: A Framework for species distribution Modelling.

SplitX: Split Guest/Hypervisor Execution on Multi-Core 林孟諭 Dept. of Electrical Engineering National Cheng Kung University Tainan, Taiwan, R.O.C.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

The Fast Optimal Voltage Partitioning Algorithm For Peak Power Density Minimization Jia Wang, Shiyan Hu Department of Electrical and Computer Engineering.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

Embedding Constraint Satisfaction using Parallel Soft-Core Processors on FPGAs Prasad Subramanian, Brandon Eames, Department of Electrical Engineering,

Background Gaussian Elimination Fault Tolerance Single or multiple core failures: Single or multiple core additions: Simultaneous core failures and additions:

Performance evaluation of component-based software systems Seminar of Component Engineering course Rofideh hadighi 7 Jan 2010.

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.

Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.

Instrumentation in Software Dynamic Translators for Self-Managed Systems Bruce R. Childers Naveen Kumar, Jonathan Misurda and Mary.

Distributed Credit-based Non-Preemptive Resource Management Scheme for Hard Real-time Systems 林鼎原 Department of Electrical Engineering National Cheng Kung.

Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.

SoC CAD 2015/11/22 1 Instruction Set Extensions for Multi-Threading in LEON3 林孟諭電機系, Department of Electrical Engineering 國立成功大學, National Cheng Kung.

Memory Hierarchy Adaptivity An Architectural Perspective Alex Veidenbaum AMRM Project sponsored by DARPA/ITO.

Real-Time Musical Applications On An Experimental OS For Multi-Core Processors 林鼎原 Department of Electrical Engineering National Cheng Kung University.

Run-time Adaptive on-chip Communication Scheme 林孟諭 Dept. of Electrical Engineering National Cheng Kung University Tainan, Taiwan, R.O.C.

Architecture View Models A model is a complete, simplified description of a system from a particular perspective or viewpoint. There is no single view.

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

Full and Para Virtualization

BarrierWatch: Characterizing Multithreaded Workloads across and within Program-Defined Epochs Socrates Demetriades and Sangyeun Cho Computer Frontiers.

ADAM: Run-time Agent-based Distributed Application Mapping for on-chip Communication 林鼎原 Department of Electrical Engineering National Cheng Kung University.

Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.

Chapter 8 System Management Semester 2. Objectives  Evaluating an operating system  Cooperation among components  The role of memory, processor,

Disco: Running Commodity Operating Systems on Scalable Multiprocessors Presented by: Pierre LaBorde, Jordan Deveroux, Imran Ali, Yazen Ghannam, Tzu-Wei.

Tessellation: Space-Time Partitioning in a Manycore Client OS 林鼎原 Department of Electrical Engineering National Cheng Kung University Tainan, Taiwan, R.O.C.

Parallel Computing Presented by Justin Reschke

Background Computer System Architectures Computer System Software.

Threads by Dr. Amin Danial Asham. References Operating System Concepts ABRAHAM SILBERSCHATZ, PETER BAER GALVIN, and GREG GAGNE.

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

PERFORMANCE OF THE OPENMP AND MPI IMPLEMENTATIONS ON ULTRASPARC SYSTEM Abstract Programmers and developers interested in utilizing parallel programming.

Ioannis E. Venetis Department of Computer Engineering and Informatics

Department of Electrical & Computer Engineering

Department of Computer Science University of California, Santa Barbara

A Self-Tuning Configurable Cache

Multithreaded Programming

Concurrency: Mutual Exclusion and Process Synchronization

Realizing Closed-loop, Online Tuning and Control for Configurable-Cache Embedded Systems: Progress and Challenges Islam S. Badreldin*, Ann Gordon-Ross*,

Maximizing Speedup through Self-Tuning of Processor Allocation

Department of Computer Science University of California, Santa Barbara

2019/10/19 Efficient Software Packet Processing on Heterogeneous and Asymmetric Hardware Architectures Author: Eva Papadogiannaki, Lazaros Koromilas, Giorgos.

MEET-IP Memory and Energy Efficient TCAM-based IP Lookup

Presentation transcript:

Adaptive Multi-Threading for Dynamic Workloads in Embedded Multiprocessors 林鼎原 Department of Electrical Engineering National Cheng Kung University Tainan, Taiwan, R.O.C 2015/11/18 1 隱藏之投影片 ★★★ Xen ??? To be studied :

2 Abstract(1/1)  We present a framework for run-time parallelism adaptation of multithreaded applications in multi-core systems.  Multi-core systems often execute diverse workloads of multithreaded programs with different system resource utilizations and varying levels of parallelism.  The proposed framework monitors the dynamically changing shared system resources, such as the available processor cores, and adapts the number of threads used by the applications throughout a parallel loop execution so as to match the parallelism level to the changed state of the system resources.  The end result is the elimination of sizable overheads due to improper level of parallelism.

3 Introduction(1/3)  Uniprocessor systems have encountered enormous design difficulties as processors reached GHz frequencies.  Design complexity,circuit synchronization, speed gap between processors and memory, power consumption, and thermal issues have hindered the rate of further advancement.  To continue the growth of computer system performance, the industry has turned to multi-core platforms  While multi-core platforms offer new dimensions to exploit application parallelism, the two major challenges are :  1. software is required to expose a sufficient number of concurrent threads  2. OS and multi-core hardware have to coordinate and manage their run-time execution.

4 Introduction(2/3)  In this paper we focus on the latter problem, that is, even when a program is well parallelized, its performance can still be severely undermined due to frequent changes of system resources availability at run-time.  Such resource fluctuations are often unpredictable and lead to partially serialized execution of parallel programs, thus necessitating a dynamic method to address it.  When the number of available cores is reduced due to activities of other applications (or the OS), some threads will need to be allocated on the same cores. Figure 2: Limited number of processor cores

5 Introduction(3/3)  The resource contentions across applications, and their disparate resource utilization, which changes the actual parallelism that the system platform can truly offer.  In this paper, we propose a framework that implements a runtime parallelism adaptation method in the presence of continuously changing availability of system resources.  Among all the shared resources, we use the most important one, the available processor cores, to demonstrate the workings of this framework.  The technique dynamically adjusts the target application to a proper parallelism level so that its efficiency approaches the optimal, corresponding to the specific system resource availability at any moment.

6 Dynamic Parallelism Adaptation(1/3)  Figure 3 shows a segment of an application execution, during which the available number of CPUs changed from four to two, and then back to four. Figure 3: Adaptive multithreading

7 Dynamic Parallelism Adaptation(2/3)  The combination of fixed number of threads and changing number of cores results in the overhead to the total execution time as shown in the figure.  The application can only adapt its parallelism at certain time intervals,  during which the system state is probed and a decisions is made whether or not to change the parallelism level.  It is important to note that the proposed parallelism adaptation is performed not only at the beginning of a parallel loop but also throughout the execution of such performance-critical application phases that support flexible parallelism level.  Clearly, the granularity of adaptation determines how effectively the proposed method can track the dynamically changing environment.

8 Dynamic Parallelism Adaptation(3/3)  Several mechanisms are needed to implement the proposed framework:  First, mechanisms are required to monitor and capture the resource availability.  This information must be made readily available at the adaptation points in order to perform.  Second, a simple and generic loop transformation is performed which based on desired granularity of adaptation and total number of loop iterations, partitions the loop iteration space into identical adaptation units.  This transformation is somewhat similar in spirit to loop unrolling.  Third, a code that determines and adapts the best parallelism level is needed.  This decision making process will be executed at the adaptation points

9 Adaptation Units and Points(1/4)  To be able to dynamically adjust application parallelism, the applications need to provide a number of cut-off “points” within a long parallel loop/function  where the resource availability is checked and a decision is made as of what is the best level of current parallelism.  The instantiation of such adaptation points can be achieved in a number of ways.  The loop is segmented into a fixed number of smaller loop iterations  each of which operates on a subset of the original iteration space,as illustrated in Figure 4. Figure 4: Adaption points instantiation

10 Adaptation Units and Points(2/4)  The smaller loops into which the original loop is segmented represent the adaptation unit (AU).  AUs are the actual forms of loop execution at different time segments.  The AU is assumed to be parallelized using different levels of parallelism,  i.e. several copies of the AU code exist each operating with a pre-determined number of threads. For instance, a two-threaded and a four-threaded versions of the AU may be present and chosen at different moments.  The control algorithm decides which AU implementation to choose based on the resource utilization at the moment.

11 Adaptation Units and Points(3/4)  The different AU versions can be obtained either through manual parallelization or by using a parallelizing compiler.  This can be easily achieved with most compiler parallelized programs.  A special code is inserted prior to the AU - this code serves the role of the adaptation point.  It rapidly checks system resource utilization and makes a decision based on the specific application and machine configuration as of whether or not a change of parallelism is required.  This decision making involves various trade-offs and needs to be performed in an extremely efficient manner.

12 Adaptation Units and Points(4/4)  The structure of the adaptation logic is illustrated in Figure 5.  When the logic determines that there is a need to change parallelism level, it selects from the pool of parallel AU implementations that best fits the current scenario and resumes execution with that implementation until the next adaption point. Figure 5: Parallelism Adaptation - Selecting the best AU

13 Experimental Results(1/2)  We start our experimental study by analyzing the overhead of the traditional non-adaptive multithreading in dynamic environments where core availability changes.  Figure.6 shows the performance of 16-threaded execution of all benchmarks on 8, 4, and 2 available processors.  The performance is measured with the number of CPU cycles (maximum across all the participating processors). Figure 6: Performance overhead for a 16-threaded implementation on a 8-, 4-, and 2-cores platform

14 Experimental Results(2/2)  The bars represent the percentage increase in cycles when compared to the one-thread per core mapping for the same problem size.  The left bar represents the percentage increase in cycles for 16 threads with 8 available cores - the comparison is against 8 threads on 8 cores;  the middle bar is for 16 threads with 4 available cores compared to 4 threads on 4 cores  the right bar represents 16 threads running on 2 available cores, compared to 2 threads on 2 cores.  A clear trend can be seen that performance overhead increases with the number of threads mapped to a single core.  processor cores, can significantly impact the performance efficiency of multithreaded programs

15 Conclusion  In this paper, we have addressed the problem of execution efficiency of multithreaded programs in multi-core platforms.  We show that system resource availability, such as processor cores, can significantly impact the performance efficiency of multithreaded programs.  Our experiments demonstrate significantly improved execution efficiency as a result of tracking and adapting to dynamically changing resource availability.