2013/06/10 Yun-Chung Yang Kandemir, M., Yemliha, T. ; Kultursay, E. Pennsylvania State Univ., University Park, PA, USA Design Automation Conference (DAC),

Slides:

Advertisements

Similar presentations

Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.

Advertisements

Dynamic Thread Mapping for High- Performance, Power-Efficient Heterogeneous Many-core Systems Guangshuo Liu Jinpyo Park Diana Marculescu Presented By Ravi.

The Interaction of Simultaneous Multithreading processors and the Memory Hierarchy: some early observations James Bulpin Computer Laboratory University.

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras

Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture Seongbeom Kim, Dhruba Chandra, and Yan Solihin Dept. of Electrical and Computer.

An Evaluation of OpenMP on Current and Emerging Multithreaded/Multicore Processors Matthew Curtis-Maury, Xiaoning Ding, Christos D. Antonopoulos, and Dimitrios.

55:035 Computer Architecture and Organization Lecture 7 155:035 Computer Architecture and Organization.

Optimizing Shared Caches in Chip Multiprocessors Samir Sapra Athula Balachandran Ravishankar Krishnaswamy.

Our approach! 6.9% Perfect L2 cache (hit rate 100% ) 1MB L2 cache Cholesky 47% speedup BASE: All cores are used to execute the application-threads. PB-GS(PB-LS)

PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

Software Architecture of High Efficiency Video Coding for Many-Core Systems with Power- Efficient Workload Balancing Muhammad Usman Karim Khan, Muhammad.

Chapter 1 CSF 2009 Computer Performance. Defining Performance Which airplane has the best performance? Chapter 1 — Computer Abstractions and Technology.

1 Improving Branch Prediction by Dynamic Dataflow-based Identification of Correlation Branches from a Larger Global History CSE 340 Project Presentation.

Chapter 13 Embedded Systems

Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.

How Multi-threading can increase on-chip parallelism

MEMORY MANAGEMENT By KUNAL KADAKIA RISHIT SHAH. Memory Memory is a large array of words or bytes, each with its own address. It is a repository of quickly.

Compilation Techniques for Energy Reduction in Horizontally Partitioned Cache Architectures Aviral Shrivastava, Ilya Issenin, Nikil Dutt Center For Embedded.

Enhancing the Platform Independence of the Real-Time Specification for Java Andy Wellings, Yang Chang and Tom Richardson University of York.

Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.

HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.

ECE 510 Brendan Crowley Paper Review October 31, 2006.

1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

1 Multi-core processors 12/1/09. 2 Multiprocessors inside a single chip It is now possible to implement multiple processors (cores) inside a single chip.

Ramazan Bitirgen, Engin Ipek and Jose F.Martinez MICRO’08 Presented by PAK,EUNJI Coordinated Management of Multiple Interacting Resources in Chip Multiprocessors.

C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.

2013/10/21 Yun-Chung Yang An Energy-Efficient Adaptive Hybrid Cache Jason Cong, Karthik Gururaj, Hui Huang, Chunyue Liu, Glenn Reinman, Yi Zou Computer.

Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.

CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR

Garo Bournoutian and Alex Orailoglu Proceedings of the 45th ACM/IEEE Design Automation Conference (DAC’08) June /10/28.

1 Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly Koushik Chakraborty Philip Wells Gurindar Sohi

Computer Organization and Architecture Tutorial 1 Kenneth Lee.

2013/12/09 Yun-Chung Yang Partitioning and Allocation of Scratch-Pad Memory for Priority-Based Preemptive Multi-Task Systems Takase, H. ; Tomiyama, H.

LLMGuard: Compiler and Runtime Support for Memory Management on Limited Local Memory (LLM) Multi-Core Architectures Ke Bai and Aviral Shrivastava Compiler.

A few issues on the design of future multicores André Seznec IRISA/INRIA.

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.

Caching in multiprocessor systems Tiina Niklander In AMICT 2009, Petrozavodsk

BarrierWatch: Characterizing Multithreaded Workloads across and within Program-Defined Epochs Socrates Demetriades and Sangyeun Cho Computer Frontiers.

1 Lecture 2: Performance, MIPS ISA Today’s topics:  Performance equations  MIPS instructions Reminder: canvas and class webpage:

1 Adaptive Parallelism for Web Search Myeongjae Jeon Rice University In collaboration with Yuxiong He (MSR), Sameh Elnikety (MSR), Alan L. Cox (Rice),

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

컴퓨터교육과 이상욱 Published in: COMPUTER ARCHITECTURE LETTERS (VOL. 10, NO. 1) Issue Date: JANUARY-JUNE 2011 Publisher: IEEE Authors: Omer Khan (Massachusetts.

IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.

Background Computer System Architectures Computer System Software.

1 Scaling Soft Processor Systems Martin Labrecque Peter Yiannacouras and Gregory Steffan University of Toronto FCCM 4/14/2008.

PERFORMANCE OF THE OPENMP AND MPI IMPLEMENTATIONS ON ULTRASPARC SYSTEM Abstract Programmers and developers interested in utilizing parallel programming.

An Approach for Enhancing Inter- processor Data Locality on Chip Multiprocessors Guilin Chen and Mahmut Kandemir The Pennsylvania State University, USA.

Dynamo: A Runtime Codesign Environment

Welcome: Intel Multicore Research Conference

Simultaneous Multithreading

Multi-core processors

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Many-core Software Development Platforms

Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor

Department of Computer Science University of California, Santa Barbara

Memory Management 11/17/2018 A. Berrached:CS4315:UHD.

Reducing Memory Reference Energy with Opportunistic Virtual Caching

Computer Architecture: A Science of Tradeoffs

Department of Computer Science University of California, Santa Barbara

Presentation transcript:

2013/06/10 Yun-Chung Yang Kandemir, M., Yemliha, T. ; Kultursay, E. Pennsylvania State Univ., University Park, PA, USA Design Automation Conference (DAC), th ACM/EDAC/IEEE Page 954 – 959 A Helper Thread Based Dynamic Cache Partitioning Scheme for Multithreaded Applications

Abstract Related Work Motivation Difference between inter and intra application Proposed Method Experiment Result Conclusion 2

Focusing on the problem of how to partition the cache space given to a multithreaded application across its threads, we show that different threads of a multithreaded application can have different cache space requirements, propose a fully automated, dynamic, intra-application cache partitioning scheme targeting emerging multicores with multilayer cache hierarchies, present a comprehensive experimental analysis of the proposed scheme, and show average improvements of 17.1% and 18.6% in SPECOMP and PARSEC suites. 3

4 Off-chip bandwidth[3, 10, 13] Processor cores[6] Resource Management Shared cache[5, 4, 8, 11, 12, 17, 18, 20] Application granularity Intra-application shared cache[16] This paper Improve the cache layer problem

Run application of facesim(PARSEC) and art(SPECOMP). Perform six scheme and recorded the Average Memory Access Time(AMAT).  No-partition  Uniform  Nonuniform  Nonuniform-L2  Nonuniform-L3  Dynamic Dynamic outer perform the rest  Divide application into fixed epoch and performs the best. 5

The objectives and the implementation are different on cache partition. The intra-application cache partition tries to minimize the latency of the slowest thread.  Runtime system or dynamic compiler The inter-application cache partition tries to optimize workload throughput.  OS problem 6

Dynamic Partition System Helper Thread whose main responsibility is to partition the cache space allocated to the application to maximize its performance. System Interfacing Performance Monitoring Performance Modeling

Each OS epoch is composed many application, which divided into 5 epoch.  Performance Monitoring  Performance Modeling  Resource Partitioning  System Interfacing  Application Execution

Use Average Memory Access Time as measure of the cache performance of a thread. AMAT  The ratio of total cycles spent on memory instructions and total number of instructions  Depends on the cache partition size  Take into account with different level of cache 9

Need to predict the impact of increasing and decreasing the cache space to a thread. Expressed a thread with 3D plot  X and Y respectively for cache space allocation from L2 and L3 Thread i, point d(s L2, s L3 ) value to build dynamic model for thread i. Purpose – predict the performance of a thread 10

i th L2 cache, q L2,i denotes the total cache way allocated to this application. q L2,i are shared by m L2,i thread(from 0 to m L2,i ) The number of ways allocated to the k th thread is denoted as s L2,i (k) 11

P[t] denotes cache resources(numbers of way in L2 & L3). 12

New partition information is delivered to the OS using system call. Add new instruction to ISA COID = core ID, CLVL = cache level, CAID = cache ID, W = 64bit wide way allocation 13

The experimental environment Compare with other scheme  Average Memory Access Time 。 The main target of the performance monitoring  Execution Cycle 14

SIMICS and GEMS to model below multicore architecture. Run SPECOMP and PARSEC application. Use 120 million instruction as application epoch. 15

Perform 8 schemes and recorded average memory access time  No-partition  Uniform – as evenly as possible for each core  Static Best – static partition for best result through exhaustive search  Dynamic – the proposed method  Dynamic-L2 – partition only L2  Dynamic-L3 – partition only L3  L2+L3 – a separate performance model for each one.  Ideal – optimal strategy 16

17

18 Shows that balancing the data access latency of different threads. As the execution went on, they all end up at about 8 AMAT(cycle).

Intra-application cache partitioning for multithread Dynamic model, able to partition cache in multiple layer. Average improvement of 17.1% in SECOMP and 18.6% in PARSEC. My Comment  Remind me the importance of software and hardware cooperation.  Thread is a main issue in CMP. 19