Presenter: Zong Ze-Huang Fast and Accurate Resource Conflict Simulation for Performance Analysis of Multi- Core Systems Stattelmann, S. ; Bringmann, O.

Slides:



Advertisements
Similar presentations
INTERVAL Next Previous 13/02/ Timed extensions to SDL Analysis requirements –Assumptions on moments and duration Semantics with controllable time.
Advertisements

Simics/SystemC Hybrid Virtual Platform A Case Study
Processes Management.
Optimistic Methods for Concurrency Control By : H.T. Kung & John T. Robinson Presenters: Munawer Saeed.
Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.
D u k e S y s t e m s Time, clocks, and consistency and the JMM Jeff Chase Duke University.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
Evaluating Database-Oriented Replication Schemes in Software Transacional Memory Systems Roberto Palmieri Francesco Quaglia (La Sapienza, University of.
Caching and Virtual Memory. Main Points Cache concept – Hardware vs. software caches When caches work and when they don’t – Spatial/temporal locality.
OSCI TLM-2.0 The Transaction Level Modeling standard of the Open SystemC Initiative (OSCI)
LOGO HW/SW Co-Verification -- Mentor Graphics® Seamless CVE By: Getao Liang March, 2006.
CSCI 8150 Advanced Computer Architecture Hwang, Chapter 2 Program and Network Properties 2.3 Program Flow Mechanisms.
LIFT: A Low-Overhead Practical Information Flow Tracking System for Detecting Security Attacks Feng Qin, Cheng Wang, Zhenmin Li, Ho-seop Kim, Yuanyuan.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
USER LEVEL INTERPROCESS COMMUNICATION FOR SHARED MEMORY MULTIPROCESSORS Presented by Elakkiya Pandian CS 533 OPERATING SYSTEMS – SPRING 2011 Brian N. Bershad.
Multiscalar processors
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
 2004 Deitel & Associates, Inc. All rights reserved. Chapter 4 – Thread Concepts Outline 4.1 Introduction 4.2Definition of Thread 4.3Motivation for Threads.
November 18, 2004 Embedded System Design Flow Arkadeb Ghosal Alessandro Pinto Daniele Gasperini Alberto Sangiovanni-Vincentelli
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
Chapter 10: Architectural Design
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 15 Slide 1 Real-time Systems 1.
1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.
0 Deterministic Replay for Real- time Software Systems Alice Lee Safety, Reliability & Quality Assurance Office JSC, NASA Yann-Hang.
- 1 - EE898-HW/SW co-design Hardware/Software Codesign “Finding right combination of HW/SW resulting in the most efficient product meeting the specification”
Caching and Virtual Memory. Main Points Cache concept – Hardware vs. software caches When caches work and when they don’t – Spatial/temporal locality.
Course Outline DayContents Day 1 Introduction Motivation, definitions, properties of embedded systems, outline of the current course How to specify embedded.
Chapter 3 Memory Management: Virtual Memory
Vulnerability-Specific Execution Filtering (VSEF) for Exploit Prevention on Commodity Software Authors: James Newsome, James Newsome, David Brumley, David.
Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.
Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.
Institut für Computertechnik ICT Institute of Computer Technology Interaction of SystemC AMS Extensions with TLM 2.0 Markus Damm, Christoph.
Presenter: Zong-Ze Huang Synchronization for Hybrid MPSoC Full-System Simulation Luis Gabriel Murillo, Juan Eusse, Jovana Jovic, Sergey Yakoushkin, Rainer.
Software Performance Analysis Using CodeAnalyst for Windows Sherry Hurwitz SW Applications Manager SRD Advanced Micro Devices Lei.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Performance Model & Tools Summary Hung-Hsun Su UPC Group, HCS lab 2/5/2004.
NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA,SURATHKAL Presentation on ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS Publisher’s:
 2004 Deitel & Associates, Inc. All rights reserved. 1 Chapter 4 – Thread Concepts Outline 4.1 Introduction 4.2Definition of Thread 4.3Motivation for.
Hybrid Prototyping of MPSoCs Samar Abdi Electrical and Computer Engineering Concordia University Montreal, Canada
© 2012 xtUML.org Bill Chown – Mentor Graphics Model Driven Engineering.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
2013/12/09 Yun-Chung Yang Partitioning and Allocation of Scratch-Pad Memory for Priority-Based Preemptive Multi-Task Systems Takase, H. ; Tomiyama, H.
Presentation by Tom Hummel OverSoC: A Framework for the Exploration of RTOS for RSoC Platforms.
Optimistic Methods for Concurrency Control By: H.T. Kung and John Robinson Presented by: Frederick Ramirez.
Software Transactional Memory Should Not Be Obstruction-Free Robert Ennals Presented by Abdulai Sei.
High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.
Martin Kruliš by Martin Kruliš (v1.0)1.
Unit 4: Processes, Threads & Deadlocks June 2012 Kaplan University 1.
Tuning Threaded Code with Intel® Parallel Amplifier.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
Embedded Real-Time Systems Processing interrupts Lecturer Department University.
PINTOS: An Execution Phase Based Optimization and Simulation Tool) PINTOS: An Execution Phase Based Optimization and Simulation Tool) Wei Hsu, Jinpyo Kim,
Wolfgang Runte Slide University of Osnabrueck, Software Engineering Research Group Wolfgang Runte Software Engineering Research Group Institute.
Chapter 4 – Thread Concepts
Virtualization.
Algorithmic Improvements for Fast Concurrent Cuckoo Hashing
Sujata Ray Dey Maheshtala College Computer Science Department
Chapter 4 – Thread Concepts
Parallel Algorithm Design
5.2 Eleven Advanced Optimizations of Cache Performance
Dynamic Parallelism Martin Kruliš by Martin Kruliš (v1.0)
Real-time Software Design
Gabor Madl Ph.D. Candidate, UC Irvine Advisor: Nikil Dutt
Suhas Chakravarty, Zhuoran Zhao, Andreas Gerstlauer
Sujata Ray Dey Maheshtala College Computer Science Department
Multithreaded Programming
Request Behavior Variations
Operating System Overview
Rohan Yadav and Charles Yuan (rohany) (chenhuiy)
Dynamic Binary Translators and Instrumenters
Presentation transcript:

Presenter: Zong Ze-Huang Fast and Accurate Resource Conflict Simulation for Performance Analysis of Multi- Core Systems Stattelmann, S. ; Bringmann, O ; Rosenstiel, W. Design, Automation & Test in Europe Conference & Exhibition (DATE), 2011

This work presents a SystemC-based simulation approach for fast performance analysis of parallel software components, using source code annotated with low-level approaches for performance analysis, timing attributes obtained from binary code can be annotated even if compiler optimizations are used without requiring changes in the compiler. To consider concurrent accesses to shared resources like caches accurately during a source-level simulation, an extension of the SystemC TLM-2.0 standard for reducing the necessary synchronization overhead is proposed as well. This enables the simulation of low-level timing effects without performing a full-fledged instruction set simulation and at speeds close to pure native execution. 2

What is the Problem  Estimating software performance at an early design stage is often crucial.  Estimating the performance of software intensive systems is a complex task. 。 Instruction set simulator(ISS) can be used to estimate the performance of binary code but speed slowly. 。 Multi-core systems, the interaction of software components due to concurrent accesses to shared resources. Proposal methods  A fast and accurate approach for system level performance analysis. 。 Estimate the execution times without using an ISS, software components are annotated with timing properties form a source-level.  Quantum Giver synchronization approach is presented.  SystemC TLM-2.0 standard is described which reduces the synchronization overhead. 3

4 Fast and Accurate Resource Conflict Simulation for Performance Analysis of Multi-Core Systems Fast and Accurate Resource Conflict Simulation for Performance Analysis of Multi-Core Systems This paper: [11] [12], [13] [7], [14], [15] dynamic performance evaluation using source code compiler optimizations is supported through the use of modified compiler tool chains Describes a cache simulation specifically developed for non-functional simulation [17] The concept of Result-Oriented Modeling is similarities with the Quantum Giver approach [19] increase simulation performance of TLM-2.0 models provide solutions for an efficient temporally decoupled simulation of shared caches Reduce additional context switches

One of TLM-2.0 coding style – “fast”.  Early software development.  Another coding style is Approximately-timed –”accurate”. Using temporally decoupled simulation to increase simulation performance.  Some simulation parts that not interact with the surrounding environment frequently might run ahead of the current simulation time for a short amount of time.  Avoid unnecessary kernel synchronization points and context switches.  As soon as the local time offset of a temporally decoupled process reaches the global time quantum, it must synchronize its local time with the global simulation time by calling “wait()”. 5

6

7

TLM-LT coding style cannot replace a synchronization mechanism to resolve data dependencies between processes or accesses to shared resources. 8

The lack of a synchronization mechanism when use temporally decoupled.  Read old data or newly written data can be overwritten by an earlier write from another process which is scheduled afterwards. For instance an instruction or data cache is simulated, the order of accesses can determine whether an access is cache hit or cache miss.  In the worst case, excessive synchronization will degrade the TLM-LT simulation performance to the level of TLM-AT. 9

Quantum Giver synchronization approach is presented. Three phases:  1) Simulation Phase 。 Processes are simulated using temporal decoupling. 。 All transactions issued by an initiator are completed immediately. 。 After initiator have reached the maximal local time offset, the synchronization phase is executed.  2) Synchronization Phase 。 All target components order the transactions they have received in the simulation phase. 。 Detect any changes in the previously predicted time for transactions due to conflicts. 。 These changes are then broadcasted to all other target components, possibly triggering further changes until all components have reached a stable state.  3)Scheduling Phase 。 Quantum Giver creates SystemC events to wake up the respective process. 10

11

Using mapping between source code and the binary code, source code can be annotated information about timing behavior before it is used in the simulation model. If an optimizing compiler is used, the compiler-generated debug information might be incorrect.  Inconsistencies in the compiler-generated debug information must be eliminated.  Reconstructed relation between basic block in the binary code and source code lines. Commercial tool AbsInt aiT was integrated into the analysis flow to produce a binary-level control flow graph annotated with execution times. 12

13 Analysis and Instrumentation Work Flow

Use three synchronization approaches to tested fast and accuracy.  1. No synchronization, meaning each access to the instruction cache was executed directly during TLM-LT simulation.  2. Synchronization mechanism using the Quantum Giver approach.  3. Explicit synchronization using calls to wait before each cache access which corresponds to TLM-AT 14

The simulation performance is improved by about 25% compare with lock-step simulation. 15

The Quantum Giver synchronization approach are very close to the estimates reported by lock-step. More accurate than simulation without synchronization. 16