DAP Spr.‘98 ©UCB 1 Lecture 18: Review
DAP Spr.‘98 ©UCB 2 Cache Organization (1) How do you know if something is in the cache? (2) If it is in the cache, how to find it? Answer to (1) and (2) depends on type or organization of the cache Direct mapped cache, each memory address is associated with one possible block within the cache –Therefore, we only need to look in a single location in the cache for the data if it exists in the cache Fully Associative Cache – Block can be placed anywhere, but complex in design N-way set associative - N cache blocks for each Cache Index –Like having N direct mapped caches operating in parallel
DAP Spr.‘98 ©UCB 3 Review: Four Questions for Memory Hierarchy Designers Q1: Where can a block be placed in the upper level? (Block placement) –Fully Associative, Set Associative, Direct Mapped Q2: How is a block found if it is in the upper level? (Block identification) –Tag/Block Q3: Which block should be replaced on a miss? (Block replacement) –Random, LRU Q4: What happens on a write? (Write strategy) –Write Back or Write Through (with Write Buffer)
DAP Spr.‘98 ©UCB 4 Review: Cache Performance CPUtime = Instruction Count x (CPI execution + Mem accesses per instruction x Miss rate x Miss penalty) x Clock cycle time Misses per instruction = Memory accesses per instruction x Miss rate CPUtime = IC x (CPI execution + Misses per instruction x Miss penalty) x Clock cycle time To Improve Cache Performance: 1. Reduce the miss rate 2. Reduce the miss penalty 3. Reduce the time to hit in the cache.
DAP Spr.‘98 ©UCB 5 Cache Optimization Summary TechniqueMRMPHTComplexity Larger Block Size+–0 Higher Associativity+–1 Victim Caches+2 Pseudo-Associative Caches +2 HW Prefetching of Instr/Data+2 Compiler Controlled Prefetching+3 Compiler Reduce Misses+0 Priority to Read Misses+1 Subblock Placement ++1 Early Restart & Critical Word 1st +2 Non-Blocking Caches+3 Second Level Caches+2 miss rate miss penalty
DAP Spr.‘98 ©UCB 6 Virtual Memory Idea 1: Many Programs sharing DRAM Memory so that context switches can occur Idea 2: Allow program to be written without memory constraints – program can exceed the size of the main memory Idea 3: Relocation: Parts of the program can be placed at different locations in the memory instead of a big chunk. Virtual Memory: (1) DRAM Memory holds many programs running at same time (processes) (2) use DRAM Memory as a kind of “cache” for disk
DAP Spr.‘98 ©UCB 7 Translation Look-Aside Buffers TLB is usually small, typically 32-4,096 entries Like any other cache, the TLB can be fully associative, set associative, or direct mapped Processor TLBCache Main Memory miss hit data hit miss Disk Memory OS Fault Handler page fault/ protection violation Page Table data virtual addr. physical addr.
DAP Spr.‘98 ©UCB 8
DAP Spr.‘98 ©UCB 9 Classification of Computer Systems Flynn’s Classification SISD (Single Instruction Single Data) –Uniprocessors MISD (Multiple Instruction Single Data) –???; multiple processors on a single data stream SIMD (Single Instruction Multiple Data) –Examples: Illiac-IV, CM-2 »Simple programming model »Low overhead »Flexibility »All custom integrated circuits –(Phrase reused by Intel marketing for media instructions ~ vector) MIMD (Multiple Instruction Multiple Data) –Examples: Sun Enterprise 5000, Cray T3D, SGI Origin »Flexible »Use off-the-shelf micros MIMD current winner: Concentrate on major design emphasis <= 128 processor MIMD machines
DAP Spr.‘98 ©UCB 10 Communication Models Shared Memory –Processors communicate with shared address space –Easy on small-scale machines –Advantages: »Model of choice for uniprocessors, small-scale MPs »Ease of programming »Lower latency »Easier to use hardware controlled caching Message passing –Processors have private memories, communicate via explicit messages and protocol software –Advantages: »Less hardware, easier to design and scale »Focuses attention on costly non-local operations
DAP Spr.‘98 ©UCB 11 Symmetric Multiprocessor (SMP) Memory: centralized with uniform access time (“uma”) and bus interconnect Examples: Sun Enterprise 5000, SGI Challenge, Intel SystemPro
DAP Spr.‘98 ©UCB 12 Potential HW Cohernecy Solutions Snooping Solution (Snoopy Bus): –Send all requests for data to all processors –Processors snoop to see if they have a copy and respond accordingly –Requires broadcast, since caching information is at processors –Works well with bus (natural broadcast medium) –Dominates for small scale machines (most of the market) Directory-Based Schemes (discussed later) –Keep track of what is being shared in 1 centralized place (logically) –Distributed memory => distributed directory for scalability (avoids bottlenecks) –Send point-to-point requests to processors via network –Scales better than Snooping –Actually existed BEFORE Snooping-based schemes
DAP Spr.‘98 ©UCB 13 An Basic Snoopy Protocol Invalidation protocol, write-back cache Each block of memory is in one state: –Clean in all caches and up-to-date in memory (Shared) –OR Dirty in exactly one cache (Exclusive) –OR Not in any caches Each cache block is in one state (track these): –Shared : block can be read –OR Exclusive : cache has only copy, its writeable, and dirty –OR Invalid : block contains no data Read misses: cause all caches to snoop bus Writes to clean line are treated as misses
DAP Spr.‘98 ©UCB 14 Place read miss on bus Snoopy-Cache State Machine-III State machine for CPU requests for each cache block and for bus requests for each cache block Invalid Shared (read/only) Exclusive (read/write) CPU Read CPU Write CPU Read hit Place Write Miss on bus CPU read miss Write back block, Place read miss on bus CPU Write Place Write Miss on Bus CPU Read miss Place read miss on bus CPU Write Miss Write back cache block Place write miss on bus CPU read hit CPU write hit Cache Block State Write miss for this block Write Back Block; (abort memory access) Write miss for this block Read miss for this block Write Back Block; (abort memory access)
DAP Spr.‘98 ©UCB 15 Larger MPs Separate Memory per Processor – but sharing the same address space – Distributed Shared Memory (DSM) Provides shared memory paradigm with scalability Local or Remote access via memory management unit (TLB) – All TLBs map to the same address Access to remote memory through the network, called Interconnection Network (IN) Access to local memory takes less time compared to remote memory – Keep frequently used programs and data in local memory? Good memory allocation problem Access to different remote memories takes different times depending on where they are located – Non-Uniform Memory Access (NUMA) machines
DAP Spr.‘98 ©UCB 16 Distributed Directory MPs
DAP Spr.‘98 ©UCB 17 CC-NUMA Directory Protocol No bus and don’t want to broadcast: –interconnect no longer single arbitration point –all messages have explicit responses Terms: typically 3 processors involved –Local node where a request originates –Home node where the memory location of an address resides –Remote node has a copy of a cache block, whether exclusive or shared => Cache-to- cache transfer Q: How is a read/write done when the block is in invalid, shared or exclusive state? How much time is taken for a read operation?
DAP Spr.‘98 ©UCB 18 Interprocessor Communication Time Sender Receiver Sender Overhead Transmission time (size ÷ bandwidth) Transmission time (size ÷ bandwidth) Time of Flight Receiver Overhead Transport Latency Total Latency = Sender Overhead + Time of Flight + Message Size ÷ BW + Receiver Overhead Total Latency (processor busy) (processor busy) Includes header/trailer in BW calculation?
DAP Spr.‘98 ©UCB 19 Static Interconnection Networks Ex: Intel Paragon (2D), SGI Origin (Hypercube), Cray T3E (3DMesh) Properties of Hypercube: S = (s n-1 s n-2 … s i …s 2 s 1 s 0 ) D = (d n-1 d n-2 … d i … d 2 d 1 d 0 ) E-cube routing For i=0 to n-1 Compare s i and d i Route along i dimension if they differ. Distance = Hamming distance between S and D = the no. of dimensions by which S and D differ. Diameter = Maximum distance = n = log 2 N = Dimension of the hypercube No. of alternate parts = n Fault tolerance = (n-1) = O(log 2 N) 2D Grid 3D Cube 2D Torus
DAP Spr.‘98 ©UCB 20 Dynamic Network - Crossbar Switch Design Complexity O(N**2) for an NXN Crossbar – Why?
DAP Spr.‘98 ©UCB 21 Multistage interconnection networks Omega Network and Self Routing Note: Complexity O(Nlog 2 N) Conflict, less BW than Crossbar, but cost effective;
DAP Spr.‘98 ©UCB 22 Switching Techniques Circuit Switching: A control message is sent from source to destination and a path is reserved. Communication starts. The path is released when communication is complete. Store-and-forward policy (Packet Switching): each switch waits for the full packet to arrive in switch before sending to the next switch (good for WAN) Cut-through routing or worm hole routing: switch examines the header, decides where to send the message, and then starts forwarding it immediately –In worm hole routing, when head of message is blocked, message stays strung out over the network, potentially blocking other messages (needs only buffer the piece of the packet that is sent between switches). CM-5 uses it, with each switch buffer being 4 bits per port. –Cut through routing lets the tail continue when head is blocked, storing the whole message into an intermmediate switch. (Requires a buffer large enough to hold the largest packet).