STATIC CACHE PARTITIONING ROBUSTNESS ANALYSIS FOR EMBEDDED ON-CHIP MULTI- PROCESSORS Anca M. Molnos, Marc J.M. Heijligers, Jos T.J. van Eijndhoven NXP.

Slides:

Advertisements

Similar presentations

1a. Outline how the main memory of a computer can be partitioned b. What are the benefits of partitioning the main memory? It allows more than 1 program.

Advertisements

Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow

Parallel H.264 Decoding on an Embedded Multicore Processor

Topics covered: Memory subsystem CSE243: Introduction to Computer Architecture and Hardware/Software Interface.

1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.

Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University.

CSC457 Seminar YongKang Zhu December 6 th, 2001 About Network Processor.

Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.

On-Chip Cache Analysis A Parameterized Cache Implementation for a System-on-Chip RISC CPU.

1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian.

Using one level of Cache:

1 Lecture 16B Memories. 2 Memories in General Computers have mostly RAM ROM (or equivalent) needed to boot ROM is in same class as Programmable Logic.

A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang, Vahid F., Lysecky R. Proceedings of Design, Automation and Test in Europe Conference.

Multiprocessors Andreas Klappenecker CPSC321 Computer Architecture.

Improving the Efficiency of Memory Partitioning by Address Clustering Alberto MaciiEnrico MaciiMassimo Poncino Proceedings of the Design,Automation and.

Page 1 CS Department Parallel Design of JPEG2000 Image Compression Xiuzhen Huang CS Department UC Santa Barbara April 30th, 2003.

Evaluating Non-deterministic Multi-threaded Commercial Workloads Computer Sciences Department University of Wisconsin—Madison

1 Balanced Cache:Reducing Conflict Misses of Direct-Mapped Caches through Programmable Decoders ISCA 2006,IEEE. By Chuanjun Zhang Speaker: WeiZeng.

Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.

Automatic Tuning of Two-Level Caches to Embedded Applications Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.

ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.

COLLEGE FOR PROFESSIONAL STUDIES TOPIC OF PRESENTATION PROCESSOR IN COMPUTER.

CPE432 Chapter 5A.1Dr. W. Abu-Sufah, UJ Chapter 5A: Exploiting the Memory Hierarchy, Part 1 Adapted from Slides by Prof. Mary Jane Irwin, Penn State University.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.

DSP Lecture Series DSP Memory Architecture Dr. E.W. Hu Nov. 28, 2000.

Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.

Software Dynamics: A New Method of Evaluating Real-Time Performance of Distributed Systems Janusz Zalewski Computer Science Florida Gulf Coast University.

CIM101 : Introduction to computer Lecture 3 Memory.

Pipelined and Parallel Computing Data Dependency Analysis for 1 Hongtao Du AICIP Research Mar 9, 2006.

Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.

Computer Organization and Architecture Tutorial 1 Kenneth Lee.

6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

1 Energy-Efficient Register Access Jessica H. Tseng and Krste Asanović MIT Laboratory for Computer Science, Cambridge, MA 02139, USA SBCCI2000.

Cache Memory Chapter 17 S. Dandamudi To be used with S. Dandamudi, “Fundamentals of Computer Organization and Design,” Springer,  S. Dandamudi.

Exploiting Instruction Streams To Prevent Intrusion Milena Milenkovic.

By Islam Atta Supervised by Dr. Ihab Talkhan

1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.

ADAPTIVE CACHE-LINE SIZE MANAGEMENT ON 3D INTEGRATED MICROPROCESSORS Takatsugu Ono, Koji Inoue and Kazuaki Murakami Kyushu University, Japan ISOCC 2009.

Cache Miss-Aware Dynamic Stack Allocation Authors: S. Jang. et al. Conference: International Symposium on Circuits and Systems (ISCAS), 2007 Presenter:

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

Lecture 10 Page 1 CS 111 Online Memory Management CS 111 On-Line MS Program Operating Systems Peter Reiher.

An Approach for Enhancing Inter- processor Data Locality on Chip Multiprocessors Guilin Chen and Mahmut Kandemir The Pennsylvania State University, USA.

Igor EPIMAKHOV Abdelkader HAMEURLAIN Franck MORVAN

Microarchitecture.

Tracing and Performance Analysis Tools for Heterogeneous Multicore System by Soon Thean Siew.

Parallel Programming By J. H. Wang May 2, 2017.

Multi-core processors

Chapter 9 – Real Memory Organization and Management

Exam 2 Review Two’s Complement Arithmetic Ripple carry ALU logic and performance Look-ahead techniques, performance and equations Basic multiplication.

CSCI206 - Computer Organization & Programming

Jason F. Cantin, Mikko H. Lipasti, and James E. Smith

A unified instruction and data cache

Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor

Tosiron Adegbija and Ann Gordon-Ross+

UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department

Another Performance Evaluation of Memory Hierarchy in Embedded Systems

ICIEV 2014 Dhaka, Bangladesh

Introduction to Multiprocessors

Faustino J. Gomez, Doug Burger, and Risto Miikkulainen

Fast Communication and User Level Parallelism

Linköping University, IDA, ESLAB

A Map-Reduce System with an Alternate API for Multi-Core Environments

Automatic Tuning of Two-Level Caches to Embedded Applications

What is a Thread? A thread is similar to a process, but it typically consists of just the flow of control. Multiple threads use the address space of a.

Overview Problem Solution CPU vs Memory performance imbalance

Presentation transcript:

STATIC CACHE PARTITIONING ROBUSTNESS ANALYSIS FOR EMBEDDED ON-CHIP MULTI- PROCESSORS Anca M. Molnos, Marc J.M. Heijligers, Jos T.J. van Eijndhoven NXP Semiconductors, HTC Sorin D. Cotofana Technical University of Delft

Outline Introduction Cache Partitioning Robustness Evaluation Methods Experiment Results Conclusion

Introduction Media applications are characterized by high requirements with respect to –computation –memory bandwidth Hierarchical caches are common ways to alleviates data availability problem. Robustness is one of the main required properties in media application domain.

Associativity Based Partitioning Each task gets a number of ways (columns) from every set of the cache. One task can flush out only its own cache ways in case of cache miss. Reducing the number of misses, speeding up the application. Number of allocable resources is restricted to the number of ways in a set.

Set Based Partitioning Each task gets a different amount of sets from the cache. Equivalent to address space partitioning. The number of resources allocable is large. Implementation is more ‘intrusive” into cache organization.

Internal Robustness Internal variations in tasks performance are due to switching of tasks. L2 sensitivity function is used for internal robustness analysis. Task sensitivity function: Application sensitivity function:

External Robustness (1/2) External variations are due to different input data set.

External Robustness (2/2) Stability matrix is used for external robustness analysis Application’s stability for particular input: Overall application stability:

Experiment Environments Multiprocessor platform: –4 Trimedia processors –Private instructions and data caches –Unified 4 ways associativity L2 cache Media workload: –MediaBench

Experiments Results Internal Robustness

Experiments Results Internal Robustness

Experiments Results External Robustness

Conclusion Static cache partitioning is quite robust: –Average sensitivity of 4% –Average stability 92% Future work – dynamic cache partitioning.