Low Contention Mapping of RT Tasks onto a TilePro 64 Core Processor 1 Background Introduction = why 2 Goal 3 What 4 How 5 Experimental Result 6 Advantage.

Slides:

Advertisements

Similar presentations

QoS-based Management of Multiple Shared Resources in Dynamic Real-Time Systems Klaus Ecker, Frank Drews School of EECS, Ohio University, Athens, OH {ecker,

Advertisements

Chapter 11 – Virtual Memory Management

Background Virtual memory – separation of user logical memory from physical memory. Only part of the program needs to be in memory for execution. Logical.

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,

Presentation of Designing Efficient Irregular Networks for Heterogeneous Systems-on-Chip by Christian Neeb and Norbert Wehn and Workload Driven Synthesis.

REAL-TIME COMMUNICATION ANALYSIS FOR NOCS WITH WORMHOLE SWITCHING Presented by Sina Gholamian, 1 09/11/2011.

Variance reduction techniques. 2 Introduction Simulation models should be coded such that they are efficient. Efficiency in terms of programming ensures.

Meeting Service Level Objectives of Pig Programs Zhuoyao Zhang, Ludmila Cherkasova, Abhishek Verma, Boon Thau Loo University of Pennsylvania Hewlett-Packard.

From Sequences of Dependent Instructions to Functions An Approach for Improving Performance without ILP or Speculation Ben Rudzyn.

Synthesis of Embedded Software Using Free-Choice Petri Nets.

1 Complexity of Network Synchronization Raeda Naamnieh.

System design-related Optimization problems Michela Milano Joint work DEIS Università di Bologna Dip. Ingegneria Università di Ferrara STI Università di.

1 A Tree Based Router Search Engine Architecture With Single Port Memories Author: Baboescu, F.Baboescu, F. Tullsen, D.M. Rosu, G. Singh, S. Tullsen, D.M.Rosu,

CS244-Introduction to Embedded Systems and Ubiquitous Computing Instructor: Eli Bozorgzadeh Computer Science Department UC Irvine Winter 2010.

1 of 14 1/15 Schedulability Analysis and Optimization for the Synthesis of Multi-Cluster Distributed Embedded Systems Paul Pop, Petru Eles, Zebo Peng Embedded.

1 Multi-Core Debug Platform for NoC-Based Systems Shan Tang and Qiang Xu EDA&Testing Laboratory.

Reliability-Aware Frame Packing for the Static Segment of FlexRay Bogdan Tanasa, Unmesh Bordoloi, Petru Eles, Zebo Peng Linkoping University, Sweden 1.

1 of 14 1 Fault-Tolerant Embedded Systems: Scheduling and Optimization Viacheslav Izosimov, Petru Eles, Zebo Peng Embedded Systems Lab (ESLAB) Linköping.

Scheduling with Optimized Communication for Time-Triggered Embedded Systems Slide 1 Scheduling with Optimized Communication for Time-Triggered Embedded.

Solving the Protein Threading Problem in Parallel Nocola Yanev, Rumen Andonov Indrajit Bhattacharya CMSC 838T Presentation.

1 of 14 1/15 Design Optimization of Multi-Cluster Embedded Systems for Real-Time Applications Paul Pop, Petru Eles, Zebo Peng, Viaceslav Izosimov Embedded.

1 of 14 1 Scheduling and Optimization of Fault- Tolerant Embedded Systems Viacheslav Izosimov Embedded Systems Lab (ESLAB) Linköping University, Sweden.

Maximizing the Lifetime of Wireless Sensor Networks through Optimal Single-Session Flow Routing Y.Thomas Hou, Yi Shi, Jianping Pan, Scott F.Midkiff Mobile.

ECE 510 Brendan Crowley Paper Review October 31, 2006.

Embedded System Design Framework for Minimizing Code Size and Guaranteeing Real-Time Requirements Insik Shin, Insup Lee, & Sang Lyul Min CIS, Penn, USACSE,

Sort-Last Parallel Rendering for Viewing Extremely Large Data Sets on Tile Displays Paper by Kenneth Moreland, Brian Wylie, and Constantine Pavlakos Presented.

Real-Time Scheduling for WirelessHART Networks by Abusayeed Saifullah, You Xu, Chenyang Lu, and Yixin Chen A Presentation of Findings for CSE5095 Joshua.

Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.

ECE 720T5 Winter 2014 Cyber-Physical Systems Rodolfo Pellizzoni.

Pipelines for Future Architectures in Time Critical Embedded Systems By: R.Wilhelm, D. Grund, J. Reineke, M. Schlickling, M. Pister, and C.Ferdinand EEL.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Network Aware Resource Allocation in Distributed Clouds.

Real Time Operating Systems Scheduling & Schedulers Course originally developed by Maj Ron Smith 8-Oct-15 Dr. Alain Beaulieu Scheduling & Schedulers- 7.

Overlay Network Physical LayerR : router Overlay Layer N R R R R R N.

ROBUST RESOURCE ALLOCATION OF DAGS IN A HETEROGENEOUS MULTI-CORE SYSTEM Luis Diego Briceño, Jay Smith, H. J. Siegel, Anthony A. Maciejewski, Paul Maxwell,

1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.

Stochastic DAG Scheduling using Monte Carlo Approach Heterogeneous Computing Workshop (at IPDPS) 2012 Extended version: Elsevier JPDC (accepted July 2013,

LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:

QoS Routing in Networks with Inaccurate Information: Theory and Algorithms Roch A. Guerin and Ariel Orda Presented by: Tiewei Wang Jun Chen July 10, 2000.

Analysis and Optimization of Mixed-Criticality Applications on Partitioned Distributed Architectures Domițian Tămaș-Selicean, Sorin Ovidiu Marinescu and.

Zheng Wu. Background Motivation Analysis Framework Intra-Core Cache Analysis Cache Conflict Analysis Optimization Techniques WCRT Analysis Experiment.

CS244-Introduction to Embedded Systems and Ubiquitous Computing Instructor: Eli Bozorgzadeh Computer Science Department UC Irvine Winter 2010.

On the Performance of Window-Based Contention Managers for Transactional Memory Gokarna Sharma and Costas Busch Louisiana State University.

Task Graph Scheduling for RTR Paper Review By Gregor Scott.

1 Presented By: Michael Bieniek. Embedded systems are increasingly using chip multiprocessors (CMPs) due to their low power and high performance capabilities.

By Edward A. Lee, J.Reineke, I.Liu, H.D.Patel, S.Kim

Ch2b: Decisions &Decision Makers Decision Support Systems in the 21 st Century by George M. Marakas.

1 Iterative Integer Programming Formulation for Robust Resource Allocation in Dynamic Real-Time Systems Sethavidh Gertphol and Viktor K. Prasanna University.

Analysis of Cache Tuner Architectural Layouts for Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing.

Project Presentation By: Dean Morrison 12/6/2006 Dynamically Adaptive Prepaging for Effective Virtual Memory Management.

Static WCET Analysis vs. Measurement: What is the Right Way to Assess Real-Time Task Timing? Worst Case Execution Time Prediction by Static Program Analysis.

Static Process Scheduling

CSCI1600: Embedded and Real Time Software Lecture 23: Real Time Scheduling I Steven Reiss, Fall 2015.

Smart Hill Climbing for Agile Dynamic Mapping in Many- Core Systems Design Automation Conference(DAC), pp.1-6, May 29-June , Austin, TX, USA M. Fattah,

Efficient Resource Allocation for Wireless Multicast De-Nian Yang, Member, IEEE Ming-Syan Chen, Fellow, IEEE IEEE Transactions on Mobile Computing, April.

Energy-Efficient Randomized Switching for Maximizing Lifetime in Tree- Based Wireless Sensor Networks Sk Kajal Arefin Imon, Adnan Khan, Mario Di Francesco,

1 of 14 1/15 Schedulability-Driven Frame Packing for Multi-Cluster Distributed Embedded Systems Paul Pop, Petru Eles, Zebo Peng Embedded Systems Lab (ESLAB)

Custom Computing Machines for the Set Covering Problem Paper Written By: Christian Plessl and Marco Platzner Swiss Federal Institute of Technology, 2002.

WCET-Aware Dynamic Code Management on Scratchpads for Software-Managed Multicores Yooseong Kim 1,2, David Broman 2,3, Jian Cai 1, Aviral Shrivastava 1,2.

Task Mapping and Partition Allocation for Mixed-Criticality Real-Time Systems Domițian Tămaș-Selicean and Paul Pop Technical University of Denmark.

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

Determining Optimal Processor Speeds for Periodic Real-Time Tasks with Different Power Characteristics H. Aydın, R. Melhem, D. Mossé, P.M. Alvarez University.

Adaptable Approach to Estimating Thermal Effects in a Data Center Environment Corby Ziesman IMPACT Lab Arizona State University.

Optimization of Time-Partitions for Mixed-Criticality Real-Time Distributed Embedded Systems Domițian Tămaș-Selicean and Paul Pop Technical University.

Current Generation Hypervisor Type 1 Type 2.

Introduction | Model | Solution | Evaluation

Improved schedulability on the ρVEX polymorphic VLIW processor

Networked Real-Time Systems: Routing and Scheduling

Applying SVM to Data Bypass Prediction

Presentation transcript:

Low Contention Mapping of RT Tasks onto a TilePro 64 Core Processor 1 Background Introduction = why 2 Goal 3 What 4 How 5 Experimental Result 6 Advantage & Limitaion 7 Significance & Improvement Lei Cui

Related Terms & Concepts Predictability TilePro 64-Core Processor Contention Static Timing analysis NoC IPC Full-deplex (communication) Jitter Hyper-Period

1 Background Introduction (why) The predictability property of task execution is very important in the RT system, especially the RT tasks, in addition, its upper bound of execution times can be determined via static timing analysis. This method may result in the unsafe underestimations under a situation that when the underlying communication paths are not determined, that is, when data from multiple sources share parts of a routing path in the NoC, which can lead to a thing to happen---contention. Therefore, the contention analysis is a must to guarantee to provide a safe and reliable bounds. At the same time, the paper takes a measure of utilizing a multi-core architecture to achieve mapping tasks to cores in such a way the contention is minimized. In addition, the less is the number of cores, the more possible the overhead incurs under the situation of IPC. In addition, the contention will lead to the latency, and then lead to unsafe underestimation, and then lead to unpredictability.

1 Background Introduction (con) Drawback 1) The exhaustive approaches do not scale beyond small NoC mesh sizes as they can take days to solve mapping layouts. 2) Previous work viewed communication as temporally stateless, which limited the amount of communication that could feasibly be solved. 3) It also resulted in solutions that were overly conservative in that any potential for common message routes were considered contention. Improvement 1) by separating temporally disjoint messages when analyzing link contention scenarios and thus increasing communication predictability. Example: two messages 3  8 and 4  2 sent at the same time Effect: The contention on the link 4  5 is resulted, and then result in delay, and then latency, and then missed deadline, and then unbounded time, and then unpredictability, and then non-RT

2 Goal Increase the predictability of RT tasks on NoC architectures Models & Solutions to low or minimize contention during communications.

3 What (Contributions) Exhaustive Solver Model exhaustively maps RT tasks onto cores to minimize contention and improve predictability SBTF to map communication traces into time frames to ensure separation of analysis for temporally disjoint communication Heuristic Model, HSolver for rapid discovery of low contention solutions

4 How – SBTF (Software-Based Temporal Framing) Temporal Framing 9

4 How – Exhaustive Solver Model

4 How- Exhaustive Solver Model (continue) For example:

4 How – Heuristic Model (Hsolver)

4 How – Heuristic Model (Hsolver-con) Example: Maximum Cross Chat First (TMH) Degree(8) = 4, Degree(6) = 4 ==> 8,6 map empty cores (Group 1) Degree(3) = 3, Degree(4) = 3 ==> 3,4 map empty cores (Group 2) Degree(7) = 2, Degree(1) = 2 ==> 7,1 map empty cores (Group 3) Degree(5) = 1, Degree(2) = 1 ==> 5,2 map empty cores (Group 4) Degree(0) = 0 ==> 0 map empty cores (Group 5) Task Scheduling Sequence is 8, 6, (6,8). 3, 4, (4, 3), 7, 1, (1, 7), 5, 2, (2, 5), 0 Here final choose sequence: 8, 6, 3, 4, 7, 1, 5, 2, 0 Maximum Cross Chat First (CMH) TaskCore

5 Experimental Result (Ex 1) The 1st experiment compares the minimum solutions for each of the solvers as the complexity of the systems increase. This experiment evaluates the minimum aggregate cost across 100 randomly generated task sets in naive, heuristic and exhaustive model mappings as the NoC size increases along with a linear increase in the number of messages.

5 Experimental Result (Ex 2) The 2nd experiment is to evaluate the HSolver approach to determine the rate at which heuristics were used to generate the low-cost solution. The left result shows the core selection strategies and the percent of use of each during heuristic solving, and a significant variation in the effectiveness of core strategies. Overall, minimizing the distances between frequently communicating cores is the most beneficial heuristic. The right picture shows that correlates well with the results where two selection strategies account for 98% of the low-cost solutions. The most effective solution is generally obtained by selecting tasks by Maximum Cross-Chat relative to the currently mapped tasks. Percent Use of Core Selection Strategies Percent Use of Task Selection Strategies

5 Experimental Result (Ex 3) The 3rd experiment assesses the impact of link contention on communication jitter. This figure shows that any single contended link can have a significant impact on the standard deviation of transfer latencies. X-axis represents the 10 randomly generated task sets, each of which contains 200 messages within their hyper-period; Y-axis represents the standard deviation in clock cycles for different tasks sets for the three mapping approaches. Table shows the timing results for each configuration evaluated in this experiment, all results determined by the heuristic approach converged within a second. Using the exhaustive solver, convergence can take up to 70 of minutes for solutions with contention.

5 Experimental Result (Ex 4) The 4th experiments illustrate the impact of unavoidable contention on real- time predictability. This experiment shows the worst-case experienced over multiple runs and emphasises the significant impact that contention can have on bounding WCET. These pictures depict the cost for sends and receives for one-to-one and two-to-one pairing of senders/receivers

6 Focus-on & Improvement NoC architecture with static routing without alternate path routing Address homogeneous architecture & resource mapping to reduce overhead Hard RT system and consider communication first rather Predictability for RT system instead of power & utilize currently available architectures instead of resorting to simulation Reduction of contention to increase predictability Implement on top of an architecture that does not provide contention avoidance at the hardware level Software model allows for variable frame sizing to avoid impeding performance in system with little contention Improvement: 1) the exhaustive solver to determine optimal mapping for solvable NoCs; 2) Hsolver generates fast and low contention solutions for heavily contended NoCs; 3) Hsolver can reduce aggregate contention by up to 70% while reducing jitter by up to 40%;

7 Significance 1) the first work to consider IPC for WC time frames to simplify analysis and to measure the impact an actual hardware for NoC-based real-time multi-core systems. 2) the first work to address predictability of NoC communication via framing messages into temporal windows for real-time tasks.

Question Experiment 3 Experiment 4