Quadrisection-Based Task Mapping on Many-Core Processors for Energy-Efficient On-Chip Communication Nithin Michael, Yao Wang, G. Edward Suh and Ao Tang.

Slides:



Advertisements
Similar presentations
Multilevel Hypergraph Partitioning Daniel Salce Matthew Zobel.
Advertisements

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,
System-level Trade-off of Networks-on-Chip Architecture Choices Network-on-Chip System-on-Chip Group, CSE-IMM, DTU.
1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.
NTPT: On the End-to-End Traffic Prediction in the On-Chip Networks Yoshi Shih-Chieh Huang 1, June 16, Department of Computer Science, National Tsing.
Distributed Message Passing for Large Scale Graphical Models Alexander Schwing Tamir Hazan Marc Pollefeys Raquel Urtasun CVPR2011.
Advanced Topics in Algorithms and Data Structures An overview of the lecture 2 Models of parallel computation Characteristics of SIMD models Design issue.
Parallel Simulation etc Roger Curry Presentation on Load Balancing.
CS244-Introduction to Embedded Systems and Ubiquitous Computing Instructor: Eli Bozorgzadeh Computer Science Department UC Irvine Winter 2010.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 15: March 12, 2007 Interconnect 3: Richness.
Abdullah Mueen UC Riverside Suman Nath Microsoft Research Jie Liu Microsoft Research.
Task Assignment and Transaction Clustering Heuristics.
Lei Wang, Yuho Jin, Hyungjun Kim and Eun Jung Kim
1 Multi - Core fast Communication for SoPC Multi - Core fast Communication for SoPC Technion – Israel Institute of Technology Department of Electrical.
CS294-6 Reconfigurable Computing Day 10 September 24, 1998 Interconnect Richness.
Online Data Gathering for Maximizing Network Lifetime in Sensor Networks IEEE transactions on Mobile Computing Weifa Liang, YuZhen Liu.
D Nagesh Kumar, IIScOptimization Methods: M1L4 1 Introduction and Basic Concepts Classical and Advanced Techniques for Optimization.
Applying Edge Partitioning to SPFD's 1 Applying Edge Partitioning to SPFD’s 219B Project Presentation Trevor Meyerowitz Mentor: Subarna Sinha Professor:
CoNA : Dynamic Application Mapping for Congestion Reduction in Many-Core Systems 2012 IEEE 30th International Conference on Computer Design (ICCD) M. Fattah,
Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.
A Survey on Channel Assignment for Multi-Radio Meshed Networks
VOLTAGE SCHEDULING HEURISTIC for REAL-TIME TASK GRAPHS D. Roychowdhury, I. Koren, C. M. Krishna University of Massachusetts, Amherst Y.-H. Lee Arizona.
Vikramaditya. What is a Sensor Network?  Sensor networks mainly constitute of inexpensive sensors densely deployed for data collection from the field.
Topology Design for Service Overlay Networks with Bandwidth Guarantees Sibelius Vieira* Jorg Liebeherr** *Department of Computer Science Catholic University.
Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science.
Application-specific Topology-aware Mapping for Three Dimensional Topologies Abhinav Bhatelé Laxmikant V. Kalé.
Energy Efficient Routing and Self-Configuring Networks Stephen B. Wicker Bart Selman Terrence L. Fine Carla Gomes Bhaskar KrishnamachariDepartment of CS.
Surface Simplification Using Quadric Error Metrics Michael Garland Paul S. Heckbert.
Efficient Gathering of Correlated Data in Sensor Networks
Predicting performance of applications and infrastructures Tania Lorido 27th May 2011.
Network Aware Resource Allocation in Distributed Clouds.
1 On the Placement of Web Server Replicas Lili Qiu, Microsoft Research Venkata N. Padmanabhan, Microsoft Research Geoffrey M. Voelker, UCSD IEEE INFOCOM’2001,
1 Distributed Energy-Efficient Scheduling for Data-Intensive Applications with Deadline Constraints on Data Grids Cong Liu and Xiao Qin Auburn University.
TEMPLATE DESIGN © Hardware Design, Synthesis, and Verification of a Multicore Communication API Ben Meakin, Ganesh Gopalakrishnan.
Many-SC Project Runtime Environment (RTE) CSAP Lab 2014/10/28.
Embedding Constraint Satisfaction using Parallel Soft-Core Processors on FPGAs Prasad Subramanian, Brandon Eames, Department of Electrical Engineering,
L11: Lower Power High Level Synthesis(2) 성균관대학교 조 준 동 교수
The Fundamentals: Algorithms, the Integers & Matrices.
An Energy Efficient Hierarchical Clustering Algorithm for Wireless Sensor Networks Seema Bandyopadhyay and Edward J. Coyle Presented by Yu Wang.
RF network in SoC1 SoC Test Architecture with RF/Wireless Connectivity 1. D. Zhao, S. Upadhyaya, M. Margala, “A new SoC test architecture with RF/wireless.
CS244-Introduction to Embedded Systems and Ubiquitous Computing Instructor: Eli Bozorgzadeh Computer Science Department UC Irvine Winter 2010.
Mobile Agent Migration Problem Yingyue Xu. Energy efficiency requirement of sensor networks Mobile agent computing paradigm Data fusion, distributed processing.
Secure and Energy-Efficient Disjoint Multi-Path Routing for WSNs Presented by Zhongming Zheng.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
6. Application mapping 6.1 Problem definition
Dual-Region Location Management for Mobile Ad Hoc Networks Yinan Li, Ing-ray Chen, Ding-chau Wang Presented by Youyou Cao.
University “Ss. Cyril and Methodus” SKOPJE Cluster-based MDS Algorithm for Nodes Localization in Wireless Sensor Networks Ass. Biljana Stojkoska.
Presenter : Kuang-Jui Hsu Date : 2011/3/24(Thur.).
Networks-on-Chip (NoC) Suleyman TOSUN Computer Engineering Deptartment Hacettepe University, Turkey.
Abstract As transistor sizes shrink and we approach the ``end of Moore's law'', interconnects, both on-chip and off-chip, will represent the biggest bottleneck.
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University.
Data Structures and Algorithms in Parallel Computing Lecture 7.
Making Good Points : Application-Specific Pareto-Point Generation for Design Space Exploration using Rigorous Statistical Methods David Sheldon, Frank.
A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 13: February 6, 2003 Interconnect 3: Richness.
Multimedia Computing and Networking Jan Reduced Energy Decoding of MPEG Streams Malena Mesarina, HP Labs/UCLA CS Dept Yoshio Turner, HP Labs.
Data Consolidation: A Task Scheduling and Data Migration Technique for Grid Networks Author: P. Kokkinos, K. Christodoulopoulos, A. Kretsis, and E. Varvarigos.
Self-Hosted Placement for Massively Parallel Processor Arrays (MPPAs) Graeme Smecher, Steve Wilton, Guy Lemieux Thursday, December 10, 2009 FPT 2009.
Incremental Run-time Application Mapping for Heterogeneous Network on Chip 2012 IEEE 14th International Conference on High Performance Computing and Communications.
On the Placement of Web Server Replicas Yu Cai. Paper On the Placement of Web Server Replicas Lili Qiu, Venkata N. Padmanabhan, Geoffrey M. Voelker Infocom.
COMMUNICATING VIA FIREFLIES: GEOGRAPHIC ROUTING ON DUTY-CYCLED SENSORS S. NATH, P. B. GIBBONS IPSN 2007.
Simulation of K-Clustering in a Wireless Ad-Hoc Network using a decentralized Local Search Supervised by: Dr. Rachel Ben- Eliyahu Kirill Kazatsker Daniel.
M AESTRO : Orchestrating Predictive Resource Management in Future Multicore Systems Sangyeun Cho, Socrates Demetriades Computer Science Department University.
Introduction | Model | Solution | Evaluation
Nithin Michael, Yao Wang, G. Edward Suh and Ao Tang Cornell University
Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke
Chapter 2 from ``Introduction to Parallel Computing'',
Presentation transcript:

Quadrisection-Based Task Mapping on Many-Core Processors for Energy-Efficient On-Chip Communication Nithin Michael, Yao Wang, G. Edward Suh and Ao Tang Cornell University N. Michael, Y. Wang, G. E. Suh & A. TangQuadrisection-Based Task Mapping 1

Task Mapping Problem N. Michael, Y. Wang, G. E. Suh & A. TangQuadrisection-Based Task Mapping 2 ~Introduction~Problem FormulationQuadrisectionEvaluationConclusions  Task mapping for many-core processor  Our focus – mapping communicating application tasks to cores to minimize communication power consumption DSP

Assumptions N. Michael, Y. Wang, G. E. Suh & A. TangQuadrisection-Based Task Mapping 4 Introduction~Problem Formulation~QuadrisectionEvaluationConclusions  Applications of interest consist of multiple tasks communicating via message passing, represented by communication task graph (CTG)  Number of tasks are less than number of cores  Assume mesh network and dimension-ordered routing CTG for DSP

Problem Difficulty N. Michael, Y. Wang, G. E. Suh & A. TangQuadrisection-Based Task Mapping 4 Introduction~Problem Formulation~QuadrisectionEvaluationConclusions  A(i, j), communication rate from task i to task j B(i, j), Manhattan-distance between node i and node j of the NoC   p(i), p(j), nodes to which task i and task j are mapped  Represent the dynamic power consumption  Task mapping is exactly the Quadratic Assignment Problem  NP-Hard  Difficult to approximate

State of the Art N. Michael, Y. Wang, G. E. Suh & A. TangQuadrisection-Based Task Mapping 5 Introduction~Problem Formulation~QuadrisectionEvaluationConclusions  NMAP – Initial greedy map followed by pair wise swaps  Computation intensive, slow  BMAP – Clustering based approach merging communicating tasks  Poor quality (High cost value)  Bisection – Recursively divide the task graph into equal halves to minimize the communication between them [1] K. Srinivasan and K. S. Chatha. A Technique for Low Energy Mapping and Routing in Network-on-Chip Architectures. ISLPED 2005 [1]

Can we improve Bisection? N. Michael, Y. Wang, G. E. Suh & A. TangQuadrisection-Based Task Mapping 6 IntroductionProblem Formulation~Quadrisection~EvaluationConclusions  Key observation: Mapping is a 2D problem  Bisection uses vertical and horizontal min-cut based on Fiduccia- Mattheyses (FM) algorithm  FM is applied to 1D, the derived local optimum doesn’t consider the 2D mapping FM

Quadrisection N. Michael, Y. Wang, G. E. Suh & A. TangQuadrisection-Based Task Mapping 6 IntroductionProblem Formulation~Quadrisection~EvaluationConclusions  Main idea: apply FM algorithm to 2D mapping  Explore a different part of design space  The local optimal considers 2D mapping  Further optimize subpanel placement by brute force search FM 1 hop 2 hops

Evaluation Methodology N. Michael, Y. Wang, G. E. Suh & A. TangQuadrisection-Based Task Mapping 8 IntroductionProblem FormulationQuadrisection~Evaluation~Conclusions  Evaluations were performed numerically and using simulation   Performance was evaluated for mesh networks for  Synthetically generated random task graphs  Multimedia benchmarks  Simulations were performed using cycle level simulator and power consumption were evaluated using Orion2

Cost: Synthetic Task Graph N. Michael, Y. Wang, G. E. Suh & A. TangQuadrisection-Based Task Mapping 9 IntroductionProblem FormulationQuadrisection~Evaluation~Conclusions  100 randomly generated CTGs were mapped to 4-by-4 mesh network

Power: Multimedia Benchmarks N. Michael, Y. Wang, G. E. Suh & A. TangQuadrisection-Based Task Mapping 10 IntroductionProblem FormulationQuadrisection~Evaluation~Conclusions Communication Power

Execution Time N. Michael, Y. Wang, G. E. Suh & A. TangQuadrisection-Based Task Mapping 10 IntroductionProblem FormulationQuadrisection~Evaluation~Conclusions  Quadrisection’s runtime is comparable with previous approaches

Conclusions N. Michael, Y. Wang, G. E. Suh & A. TangQuadrisection-Based Task Mapping 11 IntroductionProblem FormulationQuadrisectionEvaluation~Conclusions~  Quadrisection outperforms commonly used mapping approaches by exploiting 2D structure  Run-time of Quadrisection is comparable to the state-of-art  Useful addition to NoC designer’s toolkit.

Cost: Multimedia Benchmarks N. Michael, Y. Wang, G. E. Suh & A. TangQuadrisection-Based Task Mapping 10 IntroductionProblem FormulationQuadrisection~Evaluation~Conclusions Cost