A Study on Intelligent Run-time Resource Management Techniques for Large Tiled Multi-Core Architectures Dong-In Kang, Jinwoo Suh, Janice O. McMahon, and.

Slides:

Advertisements

Similar presentations

Network II.5 simulator ..

Advertisements

Hadi Goudarzi and Massoud Pedram

A Novel 3D Layer-Multiplexed On-Chip Network

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.

Using Parallel Genetic Algorithm in a Predictive Job Scheduling

REAL-TIME COMMUNICATION ANALYSIS FOR NOCS WITH WORMHOLE SWITCHING Presented by Sina Gholamian, 1 09/11/2011.

1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.

The Who, What, Why and How of High Performance Computing Applications in the Cloud Soheila Abrishami 1.

A system Performance Model Instructor: Dr. Yanqing Zhang Presented by: Rajapaksage Jayampthi S.

Development of Parallel Simulator for Wireless WCDMA Network Hong Zhang Communication lab of HUT.

Distributed Process Scheduling Summery Distributed Process Scheduling Summery BY:-Yonatan Negash.

SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.

Module R R RRR R RRRRR RR R R R R Technion – Israel Institute of Technology The Era of Many-Module SoC: Revisiting the NoC Mapping Problem Isask’har (Zigi)

Towards Feasibility Region Calculus: An End-to-end Schedulability Analysis of Real- Time Multistage Execution William Hawkins and Tarek Abdelzaher Presented.

Miguel Gorgues, Dong Xiang, Jose Flich, Zhigang Yu and Jose Duato Uni. Politecnica de Valencia, Spain School of Software, Tsinghua University, China, Achieving.

Parallelized Evolution System Onur Soysal, Erkin Bahçeci Erol Şahin Dept. of Computer Engineering Middle East Technical University.

NoC Modeling Networks-on-Chips seminar May, 2008 Anton Lavro.

Adaptive Sampling for Sensor Networks Ankur Jain ٭ and Edward Y. Chang University of California, Santa Barbara DMSN 2004.

May 14, Organization Design and Dynamic Resources Huzaifa Zafar Computer Science Department University of Massachusetts, Amherst.

Predictive Load Balancing Reconfigurable Computing Group.

1 Introduction to Load Balancing: l Definition of Distributed systems. Collection of independent loosely coupled computing resources. l Load Balancing.

Rotary Router : An Efficient Architecture for CMP Interconnection Networks Pablo Abad, Valentín Puente, Pablo Prieto, and Jose Angel Gregorio University.

1 Distributed Scheduling In Sombrero, A Single Address Space Distributed Operating System Milind Patil.

The new The new MONARC Simulation Framework Iosif Legrand  California Institute of Technology.

Orion: A Power-Performance Simulator for Interconnection Networks Presented by: Ilya Tabakh RC Reading Group4/19/2006.

HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.

1 Algorithms for Bandwidth Efficient Multicast Routing in Multi-channel Multi-radio Wireless Mesh Networks Hoang Lan Nguyen and Uyen Trang Nguyen Presenter:

Omar Darwish.  Load balancing is the process of improving the performance of a parallel and distributed system through a redistribution of load among.

Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.

Pipelined Two Step Iterative Matching Algorithms for CIOQ Crossbar Switches Deng Pan and Yuanyuan Yang State University of New York, Stony Brook.

Diamonds are a Memory Controller’s Best Friend* *Also known as: Achieving Predictable Performance through Better Memory Controller Placement in Many-Core.

ElasticTree: Saving Energy in Data Center Networks 許倫愷 2013/5/28.

Distributed Quality-of-Service Routing of Best Constrained Shortest Paths. Abdelhamid MELLOUK, Said HOCEINI, Farid BAGUENINE, Mustapha CHEURFA Computers.

Server Load Balancing. Introduction Why is load balancing of servers needed? If there is only one web server responding to all the incoming HTTP requests.

Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

Security and QoS Self-Optimization in Mobile Ad Hoc Networks ZhengMing Shen and Johnson P. Thomas Presented by: Sharanpal singh.

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

Network Aware Resource Allocation in Distributed Clouds.

Semantic Information Fusion Shashi Phoha, PI Head, Information Science and Technology Division Applied Research Laboratory The Pennsylvania State.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

QoS Support in High-Speed, Wormhole Routing Networks Mario Gerla, B. Kannan, Bruce Kwan, Prasasth Palanti,Simon Walton.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

ROBUST RESOURCE ALLOCATION OF DAGS IN A HETEROGENEOUS MULTI-CORE SYSTEM Luis Diego Briceño, Jay Smith, H. J. Siegel, Anthony A. Maciejewski, Paul Maxwell,

Kevin Ross, UCSC, September Service Network Engineering Resource Allocation and Optimization Kevin Ross Information Systems & Technology Management.

Performance Model & Tools Summary Hung-Hsun Su UPC Group, HCS lab 2/5/2004.

Expanding the CASE Framework to Facilitate Load Balancing of Social Network Simulations Amara Keller, Martin Kelly, Aaron Todd.

Department of Computer Science and Engineering The Pennsylvania State University Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das Addressing.

Covilhã, 30 June Atílio Gameiro Page 1 The information in this document is provided as is and no guarantee or warranty is given that the information is.

Advanced Spectrum Management in Multicell OFDMA Networks enabling Cognitive Radio Usage F. Bernardo, J. Pérez-Romero, O. Sallent, R. Agustí Radio Communications.

Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.

A Utility-based Approach to Scheduling Multimedia Streams in P2P Systems Fang Chen Computer Science Dept. University of California, Riverside

Performance Analysis of a JPEG Encoder Mapped To a Virtual MPSoC-NoC Architecture Using TLM 林孟諭 Dept. of Electrical Engineering National Cheng Kung.

1 Iterative Integer Programming Formulation for Robust Resource Allocation in Dynamic Real-Time Systems Sethavidh Gertphol and Viktor K. Prasanna University.

Run-time Adaptive on-chip Communication Scheme 林孟諭 Dept. of Electrical Engineering National Cheng Kung University Tainan, Taiwan, R.O.C.

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

Static Process Scheduling

Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.

Ning WengANCS 2005 Design Considerations for Network Processors Operating Systems Tilman Wolf 1, Ning Weng 2 and Chia-Hui Tai 1 1 University of Massachusetts.

Virtual-Channel Flow Control William J. Dally

LACSI 2002, slide 1 Performance Prediction for Simple CPU and Network Sharing Shreenivasa Venkataramaiah Jaspal Subhlok University of Houston LACSI Symposium.

1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres IEEE PARELEC 2006.

HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.

Approved for public release, distribution unlimited Preliminary Study toward Intelligent Run-time Resource Management Techniques for Large Tiled Multi-Core.

Univ. of TehranIntroduction to Computer Network1 An Introduction to Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.

Resource Optimization for Publisher/Subscriber-based Avionics Systems Institute for Software Integrated Systems Vanderbilt University Nashville, Tennessee.

R2C2: A Network Stack for Rack-scale Computers Paolo Costa, Hitesh Ballani, Kaveh Razavi, Ian Kash Microsoft Research Cambridge EECS 582 – W161.

1 Performance Impact of Resource Provisioning on Workflows Gurmeet Singh, Carl Kesselman and Ewa Deelman Information Science Institute University of Southern.

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

On-time Network On-Chip: Analysis and Architecture CS252 Project Presentation Dai Bui.

Introduction to Load Balancing:

Presentation transcript:

A Study on Intelligent Run-time Resource Management Techniques for Large Tiled Multi-Core Architectures Dong-In Kang, Jinwoo Suh, Janice O. McMahon, and Stephen P. Crago University of Southern California Information Sciences Institute Oct. 25 th, 2008 This effort was sponsored by Defense Advanced Research Projects Agency (DARPA) through the Dept. of Interior National Business Center (NBC), under grant number NBCH The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsement, either expressed or implied, of the Defense Advanced Research Projects Agency (DARPA), Dept. of Interior, NBC, or the U.S. Government.

2 Outline Motivation and System Framework Simulation Model Applications and Results  Mapping Performance Future Work and Conclusion

3 Motivation Rising processor density, multi-core architecture  Tilera TILE64 processor  Intel 80-core Teraflops prototype processor  In the future: 32x32 or larger Dynamic applications  With large-scale architectures, multiple applications will share tile array; need to arbitrate resources between applications  Requirements are unpredictable and changing in response to environment  Need for dynamic mapping, allocation/deallocation, resource management Need for more intelligent run-time systems  Better resource allocation for efficiency  New system capabilities to alleviate programming burden  Dynamic reactivity for changing scenarios  Complex optimization space

4 Research Goal: Intelligent Run-Time Resource Allocation Application of knowledge-based and learning techniques to the problem of dynamic resource allocation in a tiled architecture  Planning and optimization of resource usage given application constraints  Knowledge derived from dynamic performance profiling, modeling, and prediction  Run-time learning of resource allocation trade-offs Proof-of-concept demonstration: simulation of application workload to compare performance of manual and automatic resource allocations  Define representative application  Develop scalable simulation  Define resource mapping problem and apply intelligent technique Current work: definition of problem, construction of simulator, and preliminary study to quantify performance effects

5 System Framework Machine-level Resource Allocation Application-level Task Composition Knowledge- base Application Environment Generate tasks that meet application goals in current environment Task graph, performance requirements Performance monitoring/ knowledge buildup Processing Environment Task assignment to maximize machine performance Profiling Prototype experiments in machine-level resource allocation Simulation environment Research focus

6 Outline Motivation and System Framework Simulation Model Applications and Results  Mapping Performance Future Work and Conclusion

7 ISI Simulator for Multi-Core Architecture Event-driven simulator  In SystemC  Based on ARMn multiprocessor simulator from Princeton Univ. Generic processor module  Proportional share resource scheduler  Memory latency is implicit in computation time  Parameterized specification: CPU speed, CPU-network interface speed, etc. Network  One network, 2-D mesh  5x5 crossbar switch  Packet switching model  Store and forward routing  Parameterized specification: speed, buffer size, packet overhead, etc. PE Switch Tile Profile info: Per link: Data amount Blocking time Per packet: Latency Blocking time Per task: Event log Blocking time Per CPU: Utilization

8 Outline Motivation and System Framework Simulation Model Applications and Results  Mapping Performance Future Work and Conclusion

9 Example Application: Radar Resource Management using IRT Drive IRT spreadsheets and morphing scenarios with parameters from a notional radar resource manager GMTI Task Graph input to simulator FAT requirements modeled with statistically varying processor loads, message traffic Subband Analysis Time Delay & Equalization 146MFLOPs Time Delay & Equalization P = C = T = (N srg xN ch xN pri ) Adaptive Beamforming 42MFLOPs Adaptive Beamforming P = C = T = (N ch xN srg xN pri ) Pulse Compression 112MFLOPs Pulse Compression P = C = T = (N bm xN srg xN pri ) Doppler Filtering 32.4MFLOPs Doppler Filtering P = C = T = (N bm xN srg xN pri ) STAP 47MFLOPs STAP Subband Synthesis Subband Synthesis Data Combination Target Detection 4.7MFLOPs 3D Grouping Parameter Estimation 189MFLOPs P = C = T = (N dop x (N bm *N stage )xN srg ) P = N cnb x N srg x N dop C = T = (N cnb x N srg x N dop ) P = C = T = (N cnb x N srg x N dop ) P = C = T = (N cnb x N rg x N dop ) Data Independent N sub

10 Outline Motivation and System Framework Simulation Model Applications and Results  Mapping Performance Future Work and Conclusion

11 Example: GMTI Application Mapping Task Graph (GMTI) Subband Analysis Time Delay & Equalization 146MFLOPs Adaptive Beamforming 42MFLOPs Pulse Compression 112MFLOPs Doppler Filtering 32.4MFLOPs STAP 47MFLOPs Subband Synthesis Data Combination Time Delay & Equalization 146MFLOPs Adaptive Beamforming 42MFLOPs Pulse Compression 112MFLOPs Doppler Filtering 32.4MFLOPs STAP 47MFLOPs Subband Synthesis Subband Analysis (a) Clustered(b) Fragmented Interference from other application is modeled by random traffic generator ParameterValues GMTI resource requirements 128 tiles868 tiles Free tilesFragmentedClustered Mapping techniqueRandomHeuristic * Total 16 cases

Tile GMTI Performance With lower GMTI allocation, latency is: - more sensitive to background network load when resources are fragmented With lower GMTI allocation, latency is: - more sensitive to background network load when resources are fragmented Fragmented Resources Clustered Resources

Tile GMTI Performance With higher GMTI allocation, latency is: - less sensitive to background network load - more sensitive to task mapping algorithm than fragmentation of the resources With higher GMTI allocation, latency is: - less sensitive to background network load - more sensitive to task mapping algorithm than fragmentation of the resources Heuristic Mapping Random Mapping

14 Run-time Improvement Start with random mapping Knowledge  Through run-time profiling  Communication amount at a network port  Waiting time per flit at a network port  Point-to-point communication between two tasks  Estimated flow graph  Routing path Two algorithms  Hot-spot remover  Genetic algorithm  Network-oriented knowledge Per-flit waiting time, Set of p2p communications through the port X-Y routing is assumed  Only top 100 hot-spots are kept  Flow-graph is not derived  Room for future improvement

15 Knowledge Usage Hot-spot remover algorithm  Relocate a communication going through a hotspot  Knowledge is used to estimate if the relocation makes another hotspot  If there is no corresponding information in the knowledge store, simply use the aggregated network load going through the hotspot Genetic algorithm  Previous mapping is included in the initial population  Fitness function is the inverse of the estimated end-to-end latency  End-to-end latency is estimated using knowledge and heuristic way  Algorithm Initial population Calculate fitness values Cross-over Mutate Repeat until pre-set condition Choose a mapping with the highest fitness

16 Results (128 core GMTI)

17 Results (868 core GMTI)

18 Outline Introduction Simulation Model Applications and Results  Mapping Performance Future Work and Conclusion

19 Future Work Run-time profiling technique  Low overhead  Detecting network load imbalance  Detecting communication patterns, dependencies  Detecting processor load imbalance Gradual morphing technique  Intelligent partial remapping of the application Hot spot removal, load balancing  Adaptive changes of parallelism of the application Adaptive to the status of free resources Run-time support for gradual morphing  Task migration  Dynamic changes of parallelism of the application Dynamic adaptation of code  Expression of dynamic changes of code at run-time  Dynamic code generation

20 Conclusion Large multi-tiled architectures will require intelligent run-time systems for resource allocation and dynamic mapping  Future applications will be complex and will need to respond to unexpected events  Future, large-scale multi-tiled architectures will require application sharing and arbitration of resources In this research, we have explored the effect of dynamic application variations on multi-tiled performance  Dynamic IRT with representative mode changes and load variations causes resource fragmentation over time  Application mappings must be done carefully in order to provide robust behavior Future work will involve the application of knowledge- based and machine learning algorithms to the profile- based dynamic resource management