Time Parallel Simulations I Problem-Specific Approach to Create Massively Parallel Simulations.

Slides:



Advertisements
Similar presentations
Chapter 11 – Virtual Memory Management
Advertisements

§12.4 Static Paging Algorithms
1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
Topics covered: Memory subsystem CSE243: Introduction to Computer Architecture and Hardware/Software Interface.
1 Parallel Parentheses Matching Plus Some Applications.
Chapter 6 Computer Architecture
1 Software Design Introduction  The chapter will address the following questions:  How do you factor a program into manageable program modules that can.
© Karen Miller, What do we want from our computers?  correct results we assume this feature, but consider... who defines what is correct?  fast.
CSC1016 Coursework Clarification Derek Mortimer March 2010.
Author: Kang Li, Francis Chang, Wu-chang Feng Publisher: INCON 2003 Presenter: Yun-Yan Chang Date:2010/11/03 1.
ECE7995 Caching and Prefetching Techniques in Computer Systems Lecture 8: Buffer Cache in Main Memory (IV)
Computer ArchitectureFall 2008 © November 3 rd, 2008 Nael Abu-Ghazaleh CS-447– Computer.
CS 524 (Wi 2003/04) - Asim LUMS 1 Cache Basics Adapted from a presentation by Beth Richardson
CSCI2413 Lecture 6 Operating Systems Memory Management 2 phones off (please)
RAID Ref: Stallings. Introduction The rate in improvement in secondary storage performance has been considerably less than the rate for processors and.
Memory Systems Architecture and Hierarchical Memory Systems
Computer Architecture Lecture 28 Fasih ur Rehman.
Computer Architecture and Organization Introduction.
7-1 Chapter 7 - Memory Principles of Computer Architecture by M. Murdocca and V. Heuring © 1999 M. Murdocca and V. Heuring Principles of Computer Architecture.
7-1 Chapter 7 - Memory Department of Information Technology, Radford University ITEC 352 Computer Organization Principles of Computer Architecture Miles.
Time Parallel Simulations II ATM Multiplexers and G/G/1 Queues.
Subject: Operating System.
How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.
© 2004, D. J. Foreman 1 Virtual Memory. © 2004, D. J. Foreman 2 Objectives  Avoid copy/restore entire address space  Avoid unusable holes in memory.
Computer Architecture Memory organization. Types of Memory Cache Memory Serves as a buffer for frequently accessed data Small  High Cost RAM (Main Memory)
Computer Architecture Lecture 26 Fasih ur Rehman.
SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.
Improving Cache Performance by Exploiting Read-Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez.
CSE 241 Computer Engineering (1) هندسة الحاسبات (1) Lecture #3 Ch. 6 Memory System Design Dr. Tamer Samy Gaafar Dept. of Computer & Systems Engineering.
By Teacher Asma Aleisa Year 1433 H.   Goals of memory management  To provide a convenient abstraction for programming.  To allocate scarce memory.
Lecture 5 Cache Operation ECE 463/521 Fall 2002 Edward F. Gehringer Based on notes by Drs. Eric Rotenberg & Tom Conte of NCSU.
Improving Disk Throughput in Data-Intensive Servers Enrique V. Carrera and Ricardo Bianchini Department of Computer Science Rutgers University.
Virtual Memory The memory space of a process is normally divided into blocks that are either pages or segments. Virtual memory management takes.
Parallel and Distributed Simulation Time Parallel Simulation.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.
DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:
CAM Content Addressable Memory
IP Routing table compaction and sampling schemes to enhance TCAM cache performance Author: Ruirui Guo a, Jose G. Delgado-Frias Publisher: Journal of Systems.
Cache Small amount of fast memory Sits between normal main memory and CPU May be located on CPU chip or module.
1 IP Routing table compaction and sampling schemes to enhance TCAM cache performance Author: Ruirui Guo, Jose G. Delgado-Frias Publisher: Journal of Systems.
ICC Module 3 Lesson 2 – Memory Hierarchies 1 / 25 © 2015 Ph. Janson Information, Computing & Communication Memory Hierarchies – Clip 8 – Example School.
Characteristics Location Capacity Unit of transfer Access method Performance Physical type Physical characteristics Organisation.
Lecture 5 Cache Operation
Memory Hierarchy Ideal memory is fast, large, and inexpensive
Virtual memory.
Cache Memory.
CSE 351 Section 9 3/1/12.
Multilevel Memories (Improving performance using alittle “cash”)
Chapter 9 – Real Memory Organization and Management
Cache Memory Presentation I
William Stallings Computer Organization and Architecture 7th Edition
CMPT 886: Computer Architecture Primer
Module IV Memory Organization.
Module IV Memory Organization.
Chapter 6 Memory System Design
Overheads for Computers as Components 2nd ed.
Implement FSM with fewest possible states • Least number of flip flops
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
Page Replacement FIFO, LIFO, LRU, NUR, Second chance
Cache - Optimization.
Cache Memory and Performance
Chapter Contents 7.1 The Memory Hierarchy 7.2 Random Access Memory
Presentation transcript:

Time Parallel Simulations I Problem-Specific Approach to Create Massively Parallel Simulations

Outline Introduction –Space-Time Simulation Time Parallel Simulation Fix-up Computations Example: Parallel Cache Simulation

Space-Time Framework A simulation computation can be viewed as computing the state of the physical processes in the system being modeled over simulated time. algorithm: 1. partition space-time region into non-overlapping regions 2. assign each region to a logical process 3. each LP computes state of physical system for its region, using inputs from other regions and producing new outputs to those regions 4. repeat step 3 until a fixed point is reached simulated time physical processes region in space-time diagram inputs to region LP 6 LP 5 LP 4 LP 3 LP 2 LP 1 simulated time physical processes space-parallel simulation (e.g., Time Warp)

Time Parallel Simulation Basic idea: divide simulated time axis into non-overlapping intervals each processor computes sample path of interval assigned to it Observation: The simulation computation is a sample path through the set of possible states across simulated time. simulated time possible system states processor 1 processor 2 processor 3 processor 4 processor 5 Key question: What is the initial state of each interval (processor)?

simulated time possible system states processor 1 processor 2 processor 3 processor 4 processor 5 Time Parallel Simulation: Relaxation Approach Benefit: massively parallel execution Liabilities: cost of “fix up” computation, convergence may be slow (worst case, N iterations for N processors), state may be complex 3. using final state of previous interval as initial state, “fix up” sample path 4. repeat step 3 until a fixed point is reached 1. guess initial state of each interval (processor) 2. each processor computes sample path of its interval

Example: Cache Memory Cache holds subset of entire memory –Memory organized as blocks –Hit: referenced block in cache –Miss: referenced block not in cache –Cache has multiple sets, where each set holds some number of blocks (e.g., 4); here, focus on cache references to a single set Replacement policy determines which block (of set) to delete to make room when the requested data is not in the cache (miss) –LRU: delete least recently used block (of set) from cache Implementation: Least Recently Used (LRU) stack –Stack contains address of memory (block number) –For each memory reference in input (memory ref trace) if referenced address in stack (hit), move to top of stack if not in stack (miss), place address on top of stack, deleting address at bottom

processor 1processor 2processor 3 address: LRU Stack: second iteration: processor i uses final state of processor i-1 as initial state address: first iteration: assume stack is initially empty: Example: Trace Drive Cache Simulation processor 1processor 2processor 3 LRU Stack: (idle) Done! Given a sequence of references to blocks in memory, determine number of hits and misses using LRU replacement match!

Parallel Cache Simulation Time parallel simulation works well because final state of cache for a time segment usually does not depend on the initial state of the cache at the start of the time segment LRU: state of LRU stack is independent of the initial state after memory references are made to four different blocks (if set size is four); memory references to other blocks no longer retained in the LRU stack If one assumes an empty cache at the start of each time segment, the first round simulation yields an upper bound on the number of misses during the entire simulation

Summary The space-time abstraction provides another view of parallel simulation Time Parallel Simulation –Potential for massively parallel computations –Central issue is determining the initial state of each time segment Simulation of LRU caches well suited for time parallel simulation techniques