Analytic Evaluation of Shared-Memory Systems with ILP Processors

Slides:



Advertisements
Similar presentations
Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.
Advertisements

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
Variability in Architectural Simulations of Multi-threaded Workloads Alaa R. Alameldeen and David A. Wood University of Wisconsin-Madison
ECE669 L20: Evaluation and Message Passing April 13, 2004 ECE 669 Parallel Computer Architecture Lecture 20 Evaluation and Message Passing.
CS 7810 Lecture 19 Coherence Decoupling: Making Use of Incoherence J.Huh, J. Chang, D. Burger, G. Sohi Proceedings of ASPLOS-XI October 2004.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
Lec17.1 °For in-order pipeline, 2 options: Freeze pipeline in Mem stage (popular early on: Sparc, R4000) IF ID EX Mem stall stall stall … stall Mem Wr.
SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,
SafetyNet Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
Analytic Evaluation of Shared-Memory Systems with ILP Processors Daniel J. Sorin, Vijay S. Pai, Sarita V. Adve, Mary K. Vernon, and David A. Wood Presented.
ICOM 6115: Computer Systems Performance Measurement and Evaluation August 11, 2006.
Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.
Sequential Hardware Prefetching in Shared-Memory Multiprocessors Fredrik Dahlgren, Member, IEEE Computer Society, Michel Dubois, Senior Member, IEEE, and.
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
Rassul Ayani 1 Performance of parallel and distributed systems  What is the purpose of measurement?  To evaluate a system (or an architecture)  To compare.
Pipelining and Parallelism Mark Staveley
RSIM: An Execution-Driven Simulator for ILP-Based Shared-Memory Multiprocessors and Uniprocessors.
CMP Design Choices Finding Parameters that Impact CMP Performance Sam Koblenski and Peter McClone.
Efficient Gigabit Ethernet Switch Models for Large-Scale Simulation Dong (Kevin) Jin David Nicol Matthew Caesar University of Illinois.
Sunpyo Hong, Hyesoon Kim
Additional Material CEG 4131 Computer Architecture III
An Evaluation of Memory Consistency Models for Shared- Memory Systems with ILP processors Vijay S. Pai, Parthsarthy Ranganathan, Sarita Adve and Tracy.
CS747 Analytical Evaluation of Shared-Memory Systems with Commercial Workloads Jichuan Chang.
An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing Liquin Cheng, John B. Carter and Donglai Dai cs.utah.edu by Evangelos Vlachos.
Multiprocessors – Locks
COMP 740: Computer Architecture and Implementation
Presented by: Nick Kirchem Feb 13, 2004
Cache Memory and Performance
Framework For Exploring Interconnect Level Cache Coherency
Architecture and Design of AlphaServer GS320
Multilevel Memories (Improving performance using alittle “cash”)
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Lecture 18: Coherence and Synchronization
Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper, Dan Sorin§, Mark.
12.4 Memory Organization in Multiprocessor Systems
Reactive Synchronization Algorithms for Multiprocessors
5.2 Eleven Advanced Optimizations of Cache Performance
The University of Adelaide, School of Computer Science
Example Cache Coherence Problem
The University of Adelaide, School of Computer Science
Parallel and Multiprocessor Architectures – Shared Memory
Cache Coherence Protocols:
Cache Coherence Protocols:
Interconnect with Cache Coherency Manager
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor
Improving Multiple-CMP Systems with Token Coherence
E. Bilir, R. Dickson, Y. Hu, M. Plakal, D. Sorin,
* From AMD 1996 Publication #18522 Revision E
High Performance Computing
Linköping University, IDA, ESLAB
Lecture 25: Multiprocessors
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Uniprocessor scheduling
Uniprocessor Process Management & Process Scheduling
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 24: Multiprocessors
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 21: Synchronization & Consistency
Lecture: Coherence and Synchronization
CMP Design Choices Finding Parameters that Impact CMP Performance
Lecture 18: Coherence and Synchronization
The University of Adelaide, School of Computer Science
Uniprocessor Process Management & Process Scheduling
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

Analytic Evaluation of Shared-Memory Systems with ILP Processors D.J. Sorin, V.S. Pai, S.V. Adve, M.K. Vernon, D.A. Wood Presented by Bogdan Romanescu

Introduction Motivation: Simulating shared-memory systems with ILP processors takes painfully long Hypothesis: It is possible to describe the system with a set of equations which have simple parameters capture system details Method: View memory as a system of queues and delay centers Metric: Processor throughput

System under test Cache coherent shared-memory multiprocessor Mesh interconnection Processor multiple issue out of order scheduling non blocking loads speculative execution L1 and L2 $ state tracking miss status holding registers (MSHR) Interleaved memory and directory

Model parameters Architecture parameters Application parameters number of nodes number of MSHRs NI, bus and switch occupancies Application parameters ILP parameters: , CV fsynch-write fM Directory coherence parameters: Pread, Pwrite, Pupgrade, Pwb, PL|x, PM|x,y, P3hop|x&not-memory, H, X

Estimating parameters Non-ILP dependent : fast simulators for multiprocessors with single issue in order processors ILP dependent : FastILP simulator Timestamping “Eras” division Trace-driven simulations

Analytical model Output measure: system throughput (IPC) as f(input parameters, system architecture) Iterations between 2 models Synchronous blocking model (SB): processor stalled due to load and read-modify-write MSHR blocking model (MB): processor stalled due to MSHRs full MVA equations used for computing delay Synchronizations accounted for separately (locks and barriers)

Equations Average round-trip time SB Total average residence time at NI out queue Total mean delay for each type of synchronous transaction at local NI Utilization of local NI queue Average waiting time at local NI queue due to traffic from remote nodes

Model validations Better approximation for the residual life Account for significant fsynch-write

Applications Insights into application behavior fM : ability to exploit ILP to overlap read memory requests CV: degree of burstiness Evaluation of the impact of the MSHRs number Benefits of coupled/decoupled memory and directories Analysis of programmable coherence controllers impact

Questions Is “mean time” a representative measure? How misleading can it be? Residual life: even with interpolation, accurate enough? Why are the errors going up even after using the 2 accuracy-increasing observations?