DAP Spr.‘98 ©UCB 1 Lecture 18: Review. DAP Spr.‘98 ©UCB 2 Cache Organization (1) How do you know if something is in the cache? (2) If it is in the cache,

Slides:



Advertisements
Similar presentations
L.N. Bhuyan Adapted from Patterson’s slides
Advertisements

CMSC 611: Advanced Computer Architecture
SE-292 High Performance Computing
1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
CA 714CA Midterm Review. C5 Cache Optimization Reduce miss penalty –Hardware and software Reduce miss rate –Hardware and software Reduce hit time –Hardware.
Cache Optimization Summary
CS 213: Parallel Processing Architectures Laxmi Narayan Bhuyan Lecture3.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
The Memory Hierarchy (Lectures #24) ECE 445 – Computer Organization The slides included herein were taken from the materials accompanying Computer Organization.
CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.
CIS629 Coherence 1 Cache Coherence: Snooping Protocol, Directory Protocol Some of these slides courtesty of David Patterson and David Culler.
1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
Chapter 6 Multiprocessors and Thread-Level Parallelism 吳俊興 高雄大學資訊工程學系 December 2004 EEF011 Computer Architecture 計算機結構.
1 Lecture 23: Multiprocessors Today’s topics:  RAID  Multiprocessor taxonomy  Snooping-based cache coherence protocol.
©UCB CS 162 Ch 7: Virtual Memory LECTURE 13 Instructor: L.N. Bhuyan
ENGS 116 Lecture 151 Multiprocessors and Thread-Level Parallelism Vincent Berk November 12 th, 2008 Reading for Friday: Sections Reading for Monday:
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
Parallel Processing Architectures Laxmi Narayan Bhuyan
CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
Multiprocessor Cache Coherency
MULTICOMPUTER 1. MULTICOMPUTER, YANG DIPELAJARI Multiprocessors vs multicomputers Interconnection topologies Switching schemes Communication with messages.
Spring 2003CSE P5481 Cache Coherency Cache coherent processors reading processor must get the most current value most current value is the last write Cache.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Lecture 19: Virtual Memory
CSIE30300 Computer Architecture Unit 15: Multiprocessors Hsin-Chou Chi [Adapted from material by and
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
Anshul Kumar, CSE IITD CSL718 : Multiprocessors Interconnection Mechanisms Performance Models 20 th April, 2006.
Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.
Chapter 8-2 : Multicomputers Multiprocessors vs multicomputers Multiprocessors vs multicomputers Interconnection topologies Interconnection topologies.
Virtual Memory. Virtual Memory: Topics Why virtual memory? Virtual to physical address translation Page Table Translation Lookaside Buffer (TLB)
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
Anshul Kumar, CSE IITD ECE729 : Advanced Computer Architecture Lecture 27, 28: Interconnection Mechanisms In Multiprocessors 29 th, 31 st March, 2010.
Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.
Outline Why this subject? What is High Performance Computing?
1 March 2010Summary, EE800 EE800 Circuit Elements in Digital Computations (Review) Professor S. Ko Electrical and Computer Engineering University of Saskatchewan.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.
1 Lecture 17: Multiprocessors Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )
LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
The University of Adelaide, School of Computer Science
Multi Processing prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University June 2016Multi Processing1.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
CS 704 Advanced Computer Architecture
ECE232: Hardware Organization and Design
Memory COMPUTER ARCHITECTURE
CS161 – Design and Architecture of Computer
Lecture 12 Virtual Memory.
Parallel Computers Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”
CSC 4250 Computer Architectures
Course Outline Introduction in algorithms and applications
Multiprocessor Cache Coherency
CMSC 611: Advanced Computer Architecture
Example Cache Coherence Problem
Chapter 6 Multiprocessors and Thread-Level Parallelism
Multiprocessors - Flynn’s taxonomy (1966)
CS 213: Parallel Processing Architectures
Parallel Processing Architectures
Lecture 24: Memory, VM, Multiproc
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
/ Computer Architecture and Design
CS 6290 Many-core & Interconnect
Lecture 24: Virtual Memory, Multiprocessors
Lecture 23: Virtual Memory, Multiprocessors
Presentation transcript:

DAP Spr.‘98 ©UCB 1 Lecture 18: Review

DAP Spr.‘98 ©UCB 2 Cache Organization (1) How do you know if something is in the cache? (2) If it is in the cache, how to find it? Answer to (1) and (2) depends on type or organization of the cache Direct mapped cache, each memory address is associated with one possible block within the cache –Therefore, we only need to look in a single location in the cache for the data if it exists in the cache Fully Associative Cache – Block can be placed anywhere, but complex in design N-way set associative - N cache blocks for each Cache Index –Like having N direct mapped caches operating in parallel

DAP Spr.‘98 ©UCB 3 Review: Four Questions for Memory Hierarchy Designers Q1: Where can a block be placed in the upper level? (Block placement) –Fully Associative, Set Associative, Direct Mapped Q2: How is a block found if it is in the upper level? (Block identification) –Tag/Block Q3: Which block should be replaced on a miss? (Block replacement) –Random, LRU Q4: What happens on a write? (Write strategy) –Write Back or Write Through (with Write Buffer)

DAP Spr.‘98 ©UCB 4 Review: Cache Performance CPUtime = Instruction Count x (CPI execution + Mem accesses per instruction x Miss rate x Miss penalty) x Clock cycle time Misses per instruction = Memory accesses per instruction x Miss rate CPUtime = IC x (CPI execution + Misses per instruction x Miss penalty) x Clock cycle time To Improve Cache Performance: 1. Reduce the miss rate 2. Reduce the miss penalty 3. Reduce the time to hit in the cache.

DAP Spr.‘98 ©UCB 5 Cache Optimization Summary TechniqueMRMPHTComplexity Larger Block Size+–0 Higher Associativity+–1 Victim Caches+2 Pseudo-Associative Caches +2 HW Prefetching of Instr/Data+2 Compiler Controlled Prefetching+3 Compiler Reduce Misses+0 Priority to Read Misses+1 Subblock Placement ++1 Early Restart & Critical Word 1st +2 Non-Blocking Caches+3 Second Level Caches+2 miss rate miss penalty

DAP Spr.‘98 ©UCB 6 Virtual Memory Idea 1: Many Programs sharing DRAM Memory so that context switches can occur Idea 2: Allow program to be written without memory constraints – program can exceed the size of the main memory Idea 3: Relocation: Parts of the program can be placed at different locations in the memory instead of a big chunk. Virtual Memory: (1) DRAM Memory holds many programs running at same time (processes) (2) use DRAM Memory as a kind of “cache” for disk

DAP Spr.‘98 ©UCB 7 Translation Look-Aside Buffers TLB is usually small, typically 32-4,096 entries Like any other cache, the TLB can be fully associative, set associative, or direct mapped Processor TLBCache Main Memory miss hit data hit miss Disk Memory OS Fault Handler page fault/ protection violation Page Table data virtual addr. physical addr.

DAP Spr.‘98 ©UCB 8

DAP Spr.‘98 ©UCB 9 Classification of Computer Systems Flynn’s Classification SISD (Single Instruction Single Data) –Uniprocessors MISD (Multiple Instruction Single Data) –???; multiple processors on a single data stream SIMD (Single Instruction Multiple Data) –Examples: Illiac-IV, CM-2 »Simple programming model »Low overhead »Flexibility »All custom integrated circuits –(Phrase reused by Intel marketing for media instructions ~ vector) MIMD (Multiple Instruction Multiple Data) –Examples: Sun Enterprise 5000, Cray T3D, SGI Origin »Flexible »Use off-the-shelf micros MIMD current winner: Concentrate on major design emphasis <= 128 processor MIMD machines

DAP Spr.‘98 ©UCB 10 Communication Models Shared Memory –Processors communicate with shared address space –Easy on small-scale machines –Advantages: »Model of choice for uniprocessors, small-scale MPs »Ease of programming »Lower latency »Easier to use hardware controlled caching Message passing –Processors have private memories, communicate via explicit messages and protocol software –Advantages: »Less hardware, easier to design and scale »Focuses attention on costly non-local operations

DAP Spr.‘98 ©UCB 11 Symmetric Multiprocessor (SMP) Memory: centralized with uniform access time (“uma”) and bus interconnect Examples: Sun Enterprise 5000, SGI Challenge, Intel SystemPro

DAP Spr.‘98 ©UCB 12 Potential HW Cohernecy Solutions Snooping Solution (Snoopy Bus): –Send all requests for data to all processors –Processors snoop to see if they have a copy and respond accordingly –Requires broadcast, since caching information is at processors –Works well with bus (natural broadcast medium) –Dominates for small scale machines (most of the market) Directory-Based Schemes (discussed later) –Keep track of what is being shared in 1 centralized place (logically) –Distributed memory => distributed directory for scalability (avoids bottlenecks) –Send point-to-point requests to processors via network –Scales better than Snooping –Actually existed BEFORE Snooping-based schemes

DAP Spr.‘98 ©UCB 13 An Basic Snoopy Protocol Invalidation protocol, write-back cache Each block of memory is in one state: –Clean in all caches and up-to-date in memory (Shared) –OR Dirty in exactly one cache (Exclusive) –OR Not in any caches Each cache block is in one state (track these): –Shared : block can be read –OR Exclusive : cache has only copy, its writeable, and dirty –OR Invalid : block contains no data Read misses: cause all caches to snoop bus Writes to clean line are treated as misses

DAP Spr.‘98 ©UCB 14 Place read miss on bus Snoopy-Cache State Machine-III State machine for CPU requests for each cache block and for bus requests for each cache block Invalid Shared (read/only) Exclusive (read/write) CPU Read CPU Write CPU Read hit Place Write Miss on bus CPU read miss Write back block, Place read miss on bus CPU Write Place Write Miss on Bus CPU Read miss Place read miss on bus CPU Write Miss Write back cache block Place write miss on bus CPU read hit CPU write hit Cache Block State Write miss for this block Write Back Block; (abort memory access) Write miss for this block Read miss for this block Write Back Block; (abort memory access)

DAP Spr.‘98 ©UCB 15 Larger MPs Separate Memory per Processor – but sharing the same address space – Distributed Shared Memory (DSM) Provides shared memory paradigm with scalability Local or Remote access via memory management unit (TLB) – All TLBs map to the same address Access to remote memory through the network, called Interconnection Network (IN) Access to local memory takes less time compared to remote memory – Keep frequently used programs and data in local memory? Good memory allocation problem Access to different remote memories takes different times depending on where they are located – Non-Uniform Memory Access (NUMA) machines

DAP Spr.‘98 ©UCB 16 Distributed Directory MPs

DAP Spr.‘98 ©UCB 17 CC-NUMA Directory Protocol No bus and don’t want to broadcast: –interconnect no longer single arbitration point –all messages have explicit responses Terms: typically 3 processors involved –Local node where a request originates –Home node where the memory location of an address resides –Remote node has a copy of a cache block, whether exclusive or shared => Cache-to- cache transfer Q: How is a read/write done when the block is in invalid, shared or exclusive state? How much time is taken for a read operation?

DAP Spr.‘98 ©UCB 18 Interprocessor Communication Time Sender Receiver Sender Overhead Transmission time (size ÷ bandwidth) Transmission time (size ÷ bandwidth) Time of Flight Receiver Overhead Transport Latency Total Latency = Sender Overhead + Time of Flight + Message Size ÷ BW + Receiver Overhead Total Latency (processor busy) (processor busy) Includes header/trailer in BW calculation?

DAP Spr.‘98 ©UCB 19 Static Interconnection Networks Ex: Intel Paragon (2D), SGI Origin (Hypercube), Cray T3E (3DMesh) Properties of Hypercube: S = (s n-1 s n-2 … s i …s 2 s 1 s 0 ) D = (d n-1 d n-2 … d i … d 2 d 1 d 0 ) E-cube routing For i=0 to n-1 Compare s i and d i Route along i dimension if they differ. Distance = Hamming distance between S and D = the no. of dimensions by which S and D differ. Diameter = Maximum distance = n = log 2 N = Dimension of the hypercube No. of alternate parts = n Fault tolerance = (n-1) = O(log 2 N) 2D Grid 3D Cube 2D Torus

DAP Spr.‘98 ©UCB 20 Dynamic Network - Crossbar Switch Design Complexity O(N**2) for an NXN Crossbar – Why?

DAP Spr.‘98 ©UCB 21 Multistage interconnection networks Omega Network and Self Routing Note: Complexity O(Nlog 2 N) Conflict, less BW than Crossbar, but cost effective;

DAP Spr.‘98 ©UCB 22 Switching Techniques Circuit Switching: A control message is sent from source to destination and a path is reserved. Communication starts. The path is released when communication is complete. Store-and-forward policy (Packet Switching): each switch waits for the full packet to arrive in switch before sending to the next switch (good for WAN) Cut-through routing or worm hole routing: switch examines the header, decides where to send the message, and then starts forwarding it immediately –In worm hole routing, when head of message is blocked, message stays strung out over the network, potentially blocking other messages (needs only buffer the piece of the packet that is sent between switches). CM-5 uses it, with each switch buffer being 4 bits per port. –Cut through routing lets the tail continue when head is blocked, storing the whole message into an intermmediate switch. (Requires a buffer large enough to hold the largest packet).