Analytic Evaluation of Shared-Memory Systems with ILP Processors Daniel J. Sorin, Vijay S. Pai, Sarita V. Adve, Mary K. Vernon, and David A. Wood Presented.

Slides:

Advertisements

Similar presentations

Computer-System Structures Er.Harsimran Singh

Advertisements

TRIPS Primary Memory System Simha Sethumadhavan 1.

The University of Adelaide, School of Computer Science

Cache Coherence Mechanisms (Research project) CSCI-5593

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

1 Lecture 4: Directory Protocols Topics: directory-based cache coherence implementations.

Is SC + ILP = RC? Presented by Vamshi Kadaru Chris Gniady, Babak Falsafi, and T. N. VijayKumar - Purdue University Spring 2005: CS 7968 Parallel Computer.

Multiple Processor Systems

CS 258 Parallel Computer Architecture Lecture 15.1 DASH: Directory Architecture for Shared memory Implementation, cost, performance Daniel Lenoski, et.

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

Latency Tolerance: what to do when it just won’t go away CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.

The University of Adelaide, School of Computer Science

The Performance of Spin Lock Alternatives for Shared-Memory Microprocessors Thomas E. Anderson Presented by David Woodard.

Cache Performance, Interfacing, Multiprocessors CPSC 321 Andreas Klappenecker.

ECE669 L20: Evaluation and Message Passing April 13, 2004 ECE 669 Parallel Computer Architecture Lecture 20 Evaluation and Message Passing.

CS 7810 Lecture 19 Coherence Decoupling: Making Use of Incoherence J.Huh, J. Chang, D. Burger, G. Sohi Proceedings of ASPLOS-XI October 2004.

OS2-1 Chapter 2 Computer System Structures. OS2-2 Outlines Computer System Operation I/O Structure Storage Structure Storage Hierarchy Hardware Protection.

CIS629 Coherence 1 Cache Coherence: Snooping Protocol, Directory Protocol Some of these slides courtesty of David Patterson and David Culler.

1 Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations.

1 Lecture 1: Parallel Architecture Intro Course organization:  ~5 lectures based on Culler-Singh textbook  ~5 lectures based on Larus-Rajwar textbook.

1 Lecture 20: Coherence protocols Topics: snooping and directory-based coherence protocols (Sections )

1 Lecture 18: Coherence Protocols Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections )

1 Multiple class queueing networks Mean Value Analysis - Open queueing networks - Closed queueing networks.

Computer-System Structures

1 Lecture 2: Intro and Snooping Protocols Topics: multi-core cache organizations, programming models, cache coherence (snooping-based)

Murali Vijayaraghavan MIT Computer Science and Artificial Intelligence Laboratory RAMP Retreat, UC Berkeley, January 11, 2007 A Shared.

1 Lecture 3: Directory-Based Coherence Basic operations, memory-based and cache-based directories.

CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Lec17.1 °For in-order pipeline, 2 options: Freeze pipeline in Mem stage (popular early on: Sparc, R4000) IF ID EX Mem stall stall stall … stall Mem Wr.

CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.

1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )

Module I Overview of Computer Architecture and Organization.

The Directory-Based Cache Coherence Protocol for the DASH Multiprocessor Computer System Laboratory Stanford University Daniel Lenoski, James Laudon, Kourosh.

Topics covered: Memory subsystem CSE243: Introduction to Computer Architecture and Hardware/Software Interface.

August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.

Shared Address Space Computing: Hardware Issues Alistair Rendell See Chapter 2 of Lin and Synder, Chapter 2 of Grama, Gupta, Karypis and Kumar, and also.

Chapter 6 Multiprocessor System. Introduction  Each processor in a multiprocessor system can be executing a different instruction at any time.  The.

SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

Performance of the Shasta distributed shared memory protocol Daniel J. Scales Kourosh Gharachorloo 創造情報学専攻 M グェントアンドゥク.

Memory Consistency Models Alistair Rendell See “Shared Memory Consistency Models: A Tutorial”, S.V. Adve and K. Gharachorloo Chapter 8 pp of Wilkinson.

Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.

August 1, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 9: I/O Devices and Communication Buses * Jeremy R. Johnson Wednesday,

Pipelining and Parallelism Mark Staveley

Cache Coherence Protocols A. Jantsch / Z. Lu / I. Sander.

RSIM: An Execution-Driven Simulator for ILP-Based Shared-Memory Multiprocessors and Uniprocessors.

Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 March 20, 2008 Session 9.

Lazy Release Consistency for Software Distributed Shared Memory Pete Keleher Alan L. Cox Willy Z. By Nooruddin Shaik.

1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )

Review of Computer System Organization. Computer Startup For a computer to start running when it is first powered up, it needs to execute an initial program.

Sunpyo Hong, Hyesoon Kim

An Evaluation of Memory Consistency Models for Shared- Memory Systems with ILP processors Vijay S. Pai, Parthsarthy Ranganathan, Sarita Adve and Tracy.

1 Lecture: Coherence Topics: snooping-based coherence, directory-based coherence protocols (Sections )

CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.

CS747 Analytical Evaluation of Shared-Memory Systems with Commercial Workloads Jichuan Chang.

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

The University of Adelaide, School of Computer Science

EE 382 Processor DesignWinter 98/99Michael Flynn 1 EE382 Processor Design Winter 1998 Chapter 8 Lectures Multiprocessors, Part I.

Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.

Analytic Evaluation of Shared-Memory Systems with ILP Processors

Architecture and Design of AlphaServer GS320

Lecture 18: Coherence and Synchronization

/ Computer Architecture and Design

CMSC 611: Advanced Computer Architecture

Improving Multiple-CMP Systems with Token Coherence

/ Computer Architecture and Design

Database System Architectures

Lecture 19: Coherence and Synchronization

Lecture 18: Coherence and Synchronization

Multiprocessors and Multi-computers

Presentation transcript:

Analytic Evaluation of Shared-Memory Systems with ILP Processors Daniel J. Sorin, Vijay S. Pai, Sarita V. Adve, Mary K. Vernon, and David A. Wood Presented by Vijeta Johri 17 March 2004

Motivation Architectural simulator for shared-memory systems with processors that aggressively exploit ILP requires hours to simulate a few seconds of real execution Develop an analytical model that produces results similar to simulators in seconds –Tractable system of equations –Small set of simple i/p parameters Previous models assume fixed number of outstanding memory requests

Cache coherent, release consistent, shared memory multiprocessor –MSI directory-based protocol Mesh interconnection n/w –Wormhole routing –Separate reply & request n/w L1 cache –writethrough, multiported &nonblocking L2 cache –writeback, write-allocate & nonblocking MSHRs to track the status of all outstanding misses –Coalesced misses Instruction retire from instruction window after completing execution Directory accesses overlapped with memory accesses

Model Parameters System architecture parameters Application parameters –Performance determining characteristics –Insensitive to architectural parameters –ILP parameters Ability of processor to overlap multiple memory requests –Homogenous applications Processor has same value for each parameter Memory requests equally distributed Application Parameters

Fast Application Parameters Estimation Parameters that don’t depend on ILP can be measured using fast multiprocessor simulators FastILP for ILP parameters –High performance by abstracting ILP processor & memory system –Speeds up processor simulation Completion timestamp calculation for each instruction –Non-memory instructions : timestamp of source register & functional unit availability

Fast Application Parameters Estimation Speeds up memory system simulation –Does not explicitly simulate any part of memory system beyond cache hierarchy –Divides simulated time into eras start when one or more memory replies unblock the processor end when the processor blocks again waiting for a memory reply Allows time stamp computation for load & store Estimation of τ, CV τ & f M Trace driven simulation of 1 processor –Homogenous applications –Generated by trace generation utility

Analytic Model System throughput measured –Function of dynamically changing no. of outstanding memory requests before processor stalls Synchronous blocking submodel –Fraction of time processor is stalled due to load or read-modify- write inst. until data returns –No. of Customer / processor is max no. of read requests that can be issued by processor before blocking MSHR blocking submodel –Computes additional fraction of time processor is stalled due to MSHR being full –No. of Customer / processor = no. of MSHR –Blocking time when all MSHRs occupied by writes computed

Analytic Model Mean time each customer occupies the processor –τ MB = τ / U SB (fraction of time processor isn’t stalled in SB) –τ SB = τ / U MB If M = M hw all stalls due to full MSHR –Customer represent read & write Weighted sum of the throughputs computed, weighted by the frequency of each throughput value that would be observed for the number of MSHRs in the system Separate queuing model used for locks

Model Equations SB and MB submodels use a set of customized MVA equations to compute the mean delay for customers at the processor, local and remote memory buses, directories & network interfaces –R SB = avg RTT in SB –R j synch = total avg. residence time for read at memory system resource j –Z = total fixed delay for read request Read transactions retired per unit time = M/ R SB

Model Equations –NIout denotes queues that are used to transmit message into switching n/w and is sum over all such queues of the total mean delay for each type of synchronous transaction y that can visit the the queues as request r or data d message –For computing average number of visits that a synchronous type x message from a type y transaction makes to the local NI, is multiplied by the sum of the average waiting and service times at the queue

Model Equations –The utilization of the local outgoing NI queue by type x messages of a type y transaction is equal to the average total number of visits for these messages (per round trip in the SB model) times their service time divided by the average round trip time of the transactions in the SB model –The average waiting time at the outgoing local NI queue due to traffic from remote nodes is equal to the sum over all transaction types of the synchronous and asynchronous traffic generated remotely that is seen at the queue

Model Equations –Waiting time at queue q of the local NI by remote traffic of type y that is either synchronous or asynchronous is equal to the sum over all message types x of the total number of remote customers times the waiting time that their type y transactions cause on local queue q This waiting time is equal to the time that a customer would wait for those customers in the queue plus the time that the customer would wait for the customer in service

Results FastILP can generate parameters that lead to < 12% error in throughput compared with RSIM generated ones Model estimates application throughputs ranging from 0.88 to 2.93 instructions retired per cycle under 10% relative error in only a few seconds compared with hours required on simulators