Dynamic Thread Mapping for High- Performance, Power-Efficient Heterogeneous Many-core Systems Guangshuo Liu Jinpyo Park Diana Marculescu Presented By Ravi.

Slides:



Advertisements
Similar presentations
Dynamic Power Redistribution in Failure-Prone CMPs Paula Petrica, Jonathan A. Winter * and David H. Albonesi Cornell University *Google, Inc.
Advertisements

Chapter 3 Embedded Computing in the Emerging Smart Grid Arindam Mukherjee, ValentinaCecchi, Rohith Tenneti, and Aravind Kailas Electrical and Computer.
Feedback Control Real-Time Scheduling: Framework, Modeling, and Algorithms Chenyang Lu, John A. Stankovic, Gang Tao, Sang H. Son Presented by Josh Carl.
ULC: An Unified Placement and Replacement Protocol in Multi-level Storage Systems Song Jiang and Xiaodong Zhang College of William and Mary.
1 Sizing the Streaming Media Cluster Solution for a Given Workload Lucy Cherkasova and Wenting Tang HPLabs.
MINERVA: an automated resource provisioning tool for large-scale storage systems G. Alvarez, E. Borowsky, S. Go, T. Romer, R. Becker-Szendy, R. Golding,
Energy-efficient Task Scheduling in Heterogeneous Environment 2013/10/25.
Technische universiteit eindhoven November 2000Ad Verschueren and Bart Theelen1 The Multi Micro Processor Eindhoven.
Performance, Area and Bandwidth Implications on Large-Scale CMP Cache Design Li Zhao, Ravi Iyer, Srihari Makineni, Jaideep Moses, Ramesh Illikkal, Don.
Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,
A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.
2013/06/10 Yun-Chung Yang Kandemir, M., Yemliha, T. ; Kultursay, E. Pennsylvania State Univ., University Park, PA, USA Design Automation Conference (DAC),
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
A Multi-Core Approach to Addressing the Energy-Complexity Problem in Microprocessors Rakesh Kumar Keith Farkas (HP Labs) Norman Jouppi (HP Labs) Partha.
Techniques for Multicore Thermal Management Field Cady, Bin Fu and Kai Ren.
Soft Real-Time Semi-Partitioned Scheduling with Restricted Migrations on Uniform Heterogeneous Multiprocessors Kecheng Yang James H. Anderson Dept. of.
OpenFOAM on a GPU-based Heterogeneous Cluster
Software Architecture of High Efficiency Video Coding for Many-Core Systems with Power- Efficient Workload Balancing Muhammad Usman Karim Khan, Muhammad.
Multiobjective VLSI Cell Placement Using Distributed Simulated Evolution Algorithm Sadiq M. Sait, Mustafa I. Ali, Ali Zaidi.
In-Band Flow Establishment for End-to-End QoS in RDRN Saravanan Radhakrishnan.
ECE 510 Brendan Crowley Paper Review October 31, 2006.
By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and
Adaptive Video Coding to Reduce Energy on General Purpose Processors Daniel Grobe Sachs, Sarita Adve, Douglas L. Jones University of Illinois at Urbana-Champaign.
Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.
Scaling and Packing on a Chip Multiprocessor Vincent W. Freeh Tyler K. Bletsch Freeman L. Rawson, III Austin Research Laboratory.
CPACT – The Conditional Parameter Adjustment Cache Tuner for Dual-Core Architectures + Also Affiliated with NSF Center for High- Performance Reconfigurable.
An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos.
Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.
Tufts Wireless Laboratory School Of Engineering Tufts University “Network QoS Management in Cyber-Physical Systems” Nicole Ng 9/16/20151 by Feng Xia, Longhua.
SYNAR Systems Networking and Architecture Group Scheduling on Heterogeneous Multicore Processors Using Architectural Signatures Daniel Shelepov and Alexandra.
“Low-Power, Real-Time Object- Recognition Processors for Mobile Vision Systems”, IEEE Micro Jinwook Oh ; Gyeonghoon Kim ; Injoon Hong ; Junyoung.
Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.
Ramazan Bitirgen, Engin Ipek and Jose F.Martinez MICRO’08 Presented by PAK,EUNJI Coordinated Management of Multiple Interacting Resources in Chip Multiprocessors.
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
Applying Control Theory to the Caches of Multiprocessors Department of EECS University of Tennessee, Knoxville Kai Ma.
An Analysis of Efficient Multi-Core Global Power Management Policies Authors: Canturk Isci†, Alper Buyuktosunoglu†, Chen-Yong Cher†, Pradip Bose† and Margaret.
Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.
A dynamic optimization model for power and performance management of virtualized clusters Vinicius Petrucci, Orlando Loques Univ. Federal Fluminense Niteroi,
Architectures and Algorithms for Future Wireless Local Area Networks  1 Chapter Architectures and Algorithms for Future Wireless Local Area.
VGreen: A System for Energy Efficient Manager in Virtualized Environments G. Dhiman, G Marchetti, T Rosing ISLPED 2009.
An Energy-efficient Task Scheduler for Multi-core Platforms with per-core DVFS Based on Task Characteristics Ching-Chi Lin Institute of Information Science,
Thermal-aware Phase-based Tuning of Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing This work was supported.
Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket.
 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh
Scheduling Issues on a Heterogeneous Single ISA Multicore IRISA, France Robert Guziolowski, André Seznec. Contact: 1. M. Becchi and P.
By Islam Atta Supervised by Dr. Ihab Talkhan
Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.
Resource Optimization for Publisher/Subscriber-based Avionics Systems Institute for Software Integrated Systems Vanderbilt University Nashville, Tennessee.
Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.
Rakesh Kumar Keith Farkas Norman P Jouppi,Partha Ranganathan,Dean M.Tullsen University of California, San Diego MICRO 2003 Speaker : Chun-Chung Chen Single-ISA.
M AESTRO : Orchestrating Predictive Resource Management in Future Multicore Systems Sangyeun Cho, Socrates Demetriades Computer Science Department University.
R-Storm: Resource Aware Scheduling in Storm
Adaptive Cache Partitioning on a Composite Core
Ching-Chi Lin Institute of Information Science, Academia Sinica
Intel’s Core i7 Processor
On-Time Network On-chip
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
Anne Pratoomtong ECE734, Spring2002
Some challenges in heterogeneous multi-core systems
Centar ( Global Signal Processing Expo
Lecture 21: Introduction to Process Scheduling
Lecture 21: Introduction to Process Scheduling
Presented By: Darlene Banta
Duo Liu, Bei Hua, Xianghui Hu, and Xinan Tang
Progress Report 2017/02/08.
IIS Progress Report 2016/01/18.
Scheduling of Regular Tasks in Linux
Presentation transcript:

Dynamic Thread Mapping for High- Performance, Power-Efficient Heterogeneous Many-core Systems Guangshuo Liu Jinpyo Park Diana Marculescu Presented By Ravi Teja Arrabolu (vxa132930)

Introduction The problem of dynamic thread mapping in heterogeneous many-core systems is addressed via an efficient algorithm that maximizes performance under power constraints. Heterogeneous many-core systems are composed of multiple core types with different power-performance characteristics. This paper proposes an iterative approach bounding the runtime as O(n2/m), for mapping multi-threaded applications on n cores comprising of m core types

Heterogeneous Core Types A core type is defined by a tuple of micro- architecture features and associated nominal voltage/frequency. For example, cores that differ in architectural parameters, such as issue width, cache size, number of function units, etc., are considered as different core types. In addition, even if two cores are designed identically in terms of microarchitecture but associated with different nominal frequencies, they are considered as distinct core types.

General Notations The throughput matrix is denoted by T R n×m. Each element Tij represents the throughput of thread i running on a core in type j. Throughput is defined as the total number of instructions committed per unit of time. The power matrix is denoted by P R n×m. Each element Pij represents the total power consumption for thread i running on a core in type j. The assignment matrix X {0, 1} n×m represents the assignment of threads to each core type. Thread i is mapped to core type j if and only if Xij = 1.

Constraints Objective function: Maximize performance defined by the total throughput, maximize i,j TijXij First, the power budget needs to be satisfied, i,j i,j PijXij total TDP of cores. Second, a thread can be only mapped to a single core, i : j Xij 1. The core count in each type is given, j : i Xij core countj. Finally, the mapping is a 0-1 assignment problem, and therefore i, j : i,j Xij {0, 1}.

Experiment Setup Sniper multi-core simulator was used to conduct the simulation tests. Sniper provides support of heterogeneous configurations and dynamic thread migration. It employs McPAT as the power estimation engine. All the L1 and L2 caches are private to cores and there is no L3 cache. For the cases in which the number of cores of each type is identical, the network hierarchy is configured as a mesh of tiles. If the core count is different in each type, we will have some tiles with fewer core types.

MAXIMIZATION-THEN-SWAPPING (MTS)

IMPLEMENTATION Maximization : The heuristic assigns the threads having highest throughput and maps them in descending order of throughput to the available cores. Swapping : The mapped threads are checked if they satisfy the power budget at the allotted cores, else they are mapped to appropriate downward/upward (type) cores. P.S : Once we have no more changes, the swaps are registered.

Predicted Result Vs. Runtime Result -16 core Results show the prediction error is 7% for power and 14.3% For throughput. To validate the prediction model, workloads mixes of multithreaded benchmarks randomly selected from PARSEC and SPLASH-2 benchmark suites were utilized.

Heterogeneous Cores Types Configurations

Throughput / runtime comparison with ILP solver (one thread per core) - 16 Cores

Throughput / runtime comparison with ILP solver (one thread per core) - 64 cores

Scalability Number of Cores are scaled from 0 to 1000 cores. Time for scaling number of cores >100 and < 200 is 1 ms approximately.

History Based MTS To minimize migration cost, we further enable MTS heuristic to take the original mapping (history) into account. history ratio is introduced, which ranges from 0 to 1 the number of threads that are considered in the maximization phase is limited to n×(1history ratio). As history ratio increases, the migration cost can be reduced since fewer threads will be migrated between the original mapping and the final one.

Throughput Migration Trade off

Limitations Load Balancing, No Simultaneous Multi threading The paper confines to single thread per core which limits task migration between cores, which implies that sharing the threads is not considered, resulting in some tasks may miss deadlines. Fairness The paper does not consider each process's priority and workloads, but assigns the process to the cores depending on their types, which can result in CPU utilization loss, also effecting the QoS of such real time applications.