Analytic Evaluation of Shared-Memory Systems with ILP Processors

Slides:

Advertisements

Similar presentations

Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.

Advertisements

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

Variability in Architectural Simulations of Multi-threaded Workloads Alaa R. Alameldeen and David A. Wood University of Wisconsin-Madison

ECE669 L20: Evaluation and Message Passing April 13, 2004 ECE 669 Parallel Computer Architecture Lecture 20 Evaluation and Message Passing.

CS 7810 Lecture 19 Coherence Decoupling: Making Use of Incoherence J.Huh, J. Chang, D. Burger, G. Sohi Proceedings of ASPLOS-XI October 2004.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.

Lec17.1 °For in-order pipeline, 2 options: Freeze pipeline in Mem stage (popular early on: Sparc, R4000) IF ID EX Mem stall stall stall … stall Mem Wr.

SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,

SafetyNet Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

Analytic Evaluation of Shared-Memory Systems with ILP Processors Daniel J. Sorin, Vijay S. Pai, Sarita V. Adve, Mary K. Vernon, and David A. Wood Presented.

ICOM 6115: Computer Systems Performance Measurement and Evaluation August 11, 2006.

Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.

Sequential Hardware Prefetching in Shared-Memory Multiprocessors Fredrik Dahlgren, Member, IEEE Computer Society, Michel Dubois, Senior Member, IEEE, and.

Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.

Rassul Ayani 1 Performance of parallel and distributed systems  What is the purpose of measurement?  To evaluate a system (or an architecture)  To compare.

Pipelining and Parallelism Mark Staveley

RSIM: An Execution-Driven Simulator for ILP-Based Shared-Memory Multiprocessors and Uniprocessors.

CMP Design Choices Finding Parameters that Impact CMP Performance Sam Koblenski and Peter McClone.

Efficient Gigabit Ethernet Switch Models for Large-Scale Simulation Dong (Kevin) Jin David Nicol Matthew Caesar University of Illinois.

Sunpyo Hong, Hyesoon Kim

Additional Material CEG 4131 Computer Architecture III

An Evaluation of Memory Consistency Models for Shared- Memory Systems with ILP processors Vijay S. Pai, Parthsarthy Ranganathan, Sarita Adve and Tracy.

CS747 Analytical Evaluation of Shared-Memory Systems with Commercial Workloads Jichuan Chang.

An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing Liquin Cheng, John B. Carter and Donglai Dai cs.utah.edu by Evangelos Vlachos.

Multiprocessors – Locks

COMP 740: Computer Architecture and Implementation

Presented by: Nick Kirchem Feb 13, 2004

Cache Memory and Performance

Framework For Exploring Interconnect Level Cache Coherency

Architecture and Design of AlphaServer GS320

Multilevel Memories (Improving performance using alittle “cash”)

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

Lecture 18: Coherence and Synchronization

Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper, Dan Sorin§, Mark.

12.4 Memory Organization in Multiprocessor Systems

Reactive Synchronization Algorithms for Multiprocessors

5.2 Eleven Advanced Optimizations of Cache Performance

The University of Adelaide, School of Computer Science

Example Cache Coherence Problem

The University of Adelaide, School of Computer Science

Parallel and Multiprocessor Architectures – Shared Memory

Cache Coherence Protocols:

Cache Coherence Protocols:

Interconnect with Cache Coherency Manager

CPE 631: Multithreading: Thread-Level Parallelism Within a Processor

Improving Multiple-CMP Systems with Token Coherence

E. Bilir, R. Dickson, Y. Hu, M. Plakal, D. Sorin,

* From AMD 1996 Publication #18522 Revision E

High Performance Computing

Linköping University, IDA, ESLAB

Lecture 25: Multiprocessors

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

Uniprocessor scheduling

Uniprocessor Process Management & Process Scheduling

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 24: Multiprocessors

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 21: Synchronization & Consistency

Lecture: Coherence and Synchronization

CMP Design Choices Finding Parameters that Impact CMP Performance

Lecture 18: Coherence and Synchronization

The University of Adelaide, School of Computer Science

Uniprocessor Process Management & Process Scheduling

Lecture 17 Multiprocessors and Thread-Level Parallelism

Presentation transcript:

Analytic Evaluation of Shared-Memory Systems with ILP Processors D.J. Sorin, V.S. Pai, S.V. Adve, M.K. Vernon, D.A. Wood Presented by Bogdan Romanescu

Introduction Motivation: Simulating shared-memory systems with ILP processors takes painfully long Hypothesis: It is possible to describe the system with a set of equations which have simple parameters capture system details Method: View memory as a system of queues and delay centers Metric: Processor throughput

System under test Cache coherent shared-memory multiprocessor Mesh interconnection Processor multiple issue out of order scheduling non blocking loads speculative execution L1 and L2 $ state tracking miss status holding registers (MSHR) Interleaved memory and directory

Model parameters Architecture parameters Application parameters number of nodes number of MSHRs NI, bus and switch occupancies Application parameters ILP parameters: , CV fsynch-write fM Directory coherence parameters: Pread, Pwrite, Pupgrade, Pwb, PL|x, PM|x,y, P3hop|x&not-memory, H, X

Estimating parameters Non-ILP dependent : fast simulators for multiprocessors with single issue in order processors ILP dependent : FastILP simulator Timestamping “Eras” division Trace-driven simulations

Analytical model Output measure: system throughput (IPC) as f(input parameters, system architecture) Iterations between 2 models Synchronous blocking model (SB): processor stalled due to load and read-modify-write MSHR blocking model (MB): processor stalled due to MSHRs full MVA equations used for computing delay Synchronizations accounted for separately (locks and barriers)

Equations Average round-trip time SB Total average residence time at NI out queue Total mean delay for each type of synchronous transaction at local NI Utilization of local NI queue Average waiting time at local NI queue due to traffic from remote nodes

Model validations Better approximation for the residual life Account for significant fsynch-write

Applications Insights into application behavior fM : ability to exploit ILP to overlap read memory requests CV: degree of burstiness Evaluation of the impact of the MSHRs number Benefits of coupled/decoupled memory and directories Analysis of programmable coherence controllers impact

Questions Is “mean time” a representative measure? How misleading can it be? Residual life: even with interpolation, accurate enough? Why are the errors going up even after using the 2 accuracy-increasing observations?