Interconnection Network and Prefetching

Slides:



Advertisements
Similar presentations
SE-292 High Performance Computing
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Lecture 6: Multicore Systems
Miss Penalty Reduction Techniques (Sec. 5.4) Multilevel Caches: A second level cache (L2) is added between the original Level-1 cache and main memory.
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Latency Tolerance: what to do when it just won’t go away CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.
CS 213 Commercial Multiprocessors. Origin2000 System – Shared Memory Directory state in same or separate DRAMs, accessed in parallel Upto 512 nodes (1024.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
DATA ADDRESS PREDICTION Zohair Hyder Armando Solar-Lezama CS252 – Fall 2003.
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
CS 7810 Lecture 9 Effective Hardware-Based Data Prefetching for High-Performance Processors T-F. Chen and J-L. Baer IEEE Transactions on Computers, 44(5)
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Computer Architecture System Interface Units Iolanthe II approaches Coromandel Harbour.
Memory Hierarchy Adaptivity An Architectural Perspective Alex Veidenbaum AMRM Project sponsored by DARPA/ITO.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)
1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
CS 704 Advanced Computer Architecture
Deterministic Communication with SpaceWire
Bus Interfacing Processor-Memory Bus Backplane Bus I/O Bus
Multiprocessor System Distributed System
Overview Parallel Processing Pipelining
Associativity in Caches Lecture 25
Parallel Computers Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Multiprocessor Cache Coherency
5.2 Eleven Advanced Optimizations of Cache Performance
Final Review CS144 Review Session 9 June 4, 2008 Derrick Isaacson
CS 213 Lecture 7: Multiprocessor 3: Synchronization, Prefetching
Interconnection Network Routing, Topology Design Trade-offs
CMSC 611: Advanced Computer Architecture
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
CS140 – Operating Systems Midterm Review
Lecture 14: Reducing Cache Misses
Lecture 17: Case Studies Topics: case studies for virtual memory and cache hierarchies (Sections )
Hardware Multithreading
The Stanford FLASH Multiprocessor
Lecture: Cache Innovations, Virtual Memory
Multiprocessors - Flynn’s taxonomy (1966)
CS 213: Parallel Processing Architectures
Parallel Processing Architectures
Lecture 24: Memory, VM, Multiproc
Adapted from slides by Sally McKee Cornell University
CS 213 Lecture 11: Multiprocessor 3: Directory Organization
Distributed Systems CS
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
/ Computer Architecture and Design
Latency Tolerance: what to do when it just won’t go away
15-740/ Computer Architecture Lecture 14: Prefetching
The Vector-Thread Architecture
Lecture: Cache Hierarchies
ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
Chapter 4 Multiprocessors
The University of Adelaide, School of Computer Science
Lecture 24: Virtual Memory, Multiprocessors
Lecture 23: Virtual Memory, Multiprocessors
Main Memory Background
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 18: Coherence and Synchronization
The University of Adelaide, School of Computer Science
Multiprocessors and Multi-computers
Presentation transcript:

Interconnection Network and Prefetching Lecture 12 CS 213 CS258 S99

Origin2000 System Overview Single 16”-by-11” PCB Directory state in same or separate DRAMs, accessed in parallel Upto 512 nodes (1024 processors) With 195MHz R10K processor, peak 390MFLOPS or 780 MIPS per proc Peak SysAD bus bw is 780MB/s, so also Hub-Mem Hub to router chip and to Xbow is 1.56 GB/s (both are off-board) 12/9/2019 CS258 S99

Origin Network Each router has six pairs of 1.56MB/s unidirectional links Two to nodes, four to other routers latency: 41ns pin to pin across a router Flexible cables up to 3 ft long Four “virtual channels”: request, reply, other two for priority or I/O 12/9/2019 CS258 S99

Origin I/O Xbow is 8-port crossbar, connects two Hubs (nodes) to six cards Similar to router, but simpler so can hold 8 ports Except graphics, most other devices connect through bridge and bus can reserve bandwidth for things like video or real-time Global I/O space: any proc can access any I/O device through uncached memory ops to I/O space or coherent DMA any I/O device can write to or read from any memory (comm thru routers) 12/9/2019 CS258 S99

Case Study: Cray T3D Build up info in ‘shell’ Remote memory operations encoded in address 12/9/2019 CS258 S99

Case Study: NOW General purpose processor embedded in NIC to implement VIA, discussed earlier 12/9/2019 CS258 S99

Reducing Communication Cost Reducing effective latency Avoiding Latency Tolerating Latency communication latency vs. synchronization latency vs. instruction latency sender initiated vs receiver initiated communication 12/9/2019 CS258 S99

Approaches to Latency Tolerance Large block transfer make individual transfers larger Precommunication or prefetching generate comm before point where it is actually needed proceeding past an outstanding communication event continue with independent work in same thread while event outstanding – Speculative execution multithreading - finding independent work switch processor to another thread 12/9/2019 CS258 S99

How much can you gain? Overlaping computation with all communication with one communication event at a time? Overlapping communication with communication let C be fraction of time in computation... Let L be latency and r the run length of computation between messages, how many messages must be outstanding to hide L? What limits outstanding messages overhead occupancy bandwidth network capacity? 12/9/2019 CS258 S99

Block Data Transfer Message Passing Shared Address space fragmentation local coherence problem global coherence problem 12/9/2019 CS258 S99

Benefits Under CPS Scaling 12/9/2019 CS258 S99

Precommunication Prefetching – Explain Fig. 2 and 3 of Lilja’s paper Instruction Vs Data Prefetch: Inst prefetch is easy but data prefetch is difficult due to unpredictable use of data in different applications Software Prefetch: Explicit prefetch instns are inserted in the program (prefetch scheduling)– Determination of when to put and where to put these instrns is difficult – either done manually or through difficult compiler optimization process. Hardware Prefetch: A hardware is built in CPU to prefetch data automatically from memory – When to prefetch and how much to prefetch difficult to determine. 12/9/2019 CS258 S99

Problems in Prefetching Unnecessary data being prefetched will result in increased bus and memory traffic degrading performance – for data not being used and for data arriving late Prefetched data may replace data in the processor working set – Cache pollution problem – What is stream buffer? Invalidation of prefetched data by other processors or DMA Summary: Prefetch is necessary, but how to prefetch, which data to prefetch, and when to prefetch are difficult questions that must be answered. 12/9/2019 CS258 S99

Software Prefetching Ref: VanderWiel andLilja, ACM Computing Surveys, June 2000 Consider the following Example. This loop calculates the inner product of two vectors a and b. (a) no prefetching: For (I=0; I < N; I++) Ip = ip + a[I] * b[I]; Assuming a 4-word cache block, this code segment will cause a cache miss every 4th iteration. (b) Simple prefetching: For (I=0; I < N; I++) { fetch ( &a[I+1]; fetch ( &b[I+1]; Ip = ip + a[I] * b[I]; } Problems: Why add prefetch on every loop – one fetch will bring in four words, so the values a[I+1] and b[I+1] are available in memory for 4 loops! Unnecessary prefetches will degrade performance. Prefetching should be done only every fourth iteration. 12/9/2019 CS258 S99

Software Prefetching Cond. © prefetching with loop unrolling: Unroll the loop by a factor r, where r is the number of words fetched per cache block For (I = 0; I < N; I+=4) { fetch ( &a[I+4] ); fetch ( &b[I+4] ); Ip = ip + a[I] * b[I]; Ip = ip + a[I+1] * b[I+1]; Ip = ip + a[I+2] * b[I+2]; Ip = ip + a[I+3] * b[I+3]; } Problems: When I = 0, the prefetch is for second block giving rise to cache misses during the 1st iteration. Also, why should we prefetch during the last iteration? We don’t need the data. 12/9/2019 CS258 S99

Software Prefetching Contd. (d) Software Pipelining Fetch ( &ip) | Fetch ( &a[0]); | => Prolog Fetch ( &b[0]); | For (I = 0; I < N-4; I+=4) { fetch ( &a[I+4] ); fetch ( &b[I+4] ); Ip = ip + a[I] * b[I]; Ip = ip + a[I+1] * b[I+1]; => MAIN LOOP Ip = ip + a[I+2] * b[I+2]; Ip = ip + a[I+3] * b[I+3]; } For ( ; I < N; I++) | Ip = ip + a[I] * b[I]; | => Epilog Problem: Implicit assumption in the above techniques is that prefetching one iteration ahead will hide the latency. What if a memory fetch operation takes more time? The prefetches really should be initiated X iterations ahead, where X = ceiling [L/S]. Here L = Av memory latency in cycles and S = time for computation of one iteration. 12/9/2019 CS258 S99

Software Prefetching Contd Assuming l = 100 cycles, S = 45 cycles, rewrite the final code. Fetch ( &ip); For (I = 0; I < 12; I +=4) { fetch ( &a[I]); fetch ( &b[I]); => Prolog – prefetching only } For (I = 0; I < N-12; I+=4) { fetch ( &a[I+12] ); fetch ( &b[I+12] ); Ip = ip + a[I] * b[I]; Ip = ip + a[I+1] * b[I+1]; => MAIN LOOP – prefetching and computation Ip = ip + a[I+2] * b[I+2]; Ip = ip + a[I+3] * b[I+3]; For ( ; I < N; I++) | Ip = ip + a[I] * b[I]; | => Epilog – computation only 12/9/2019 CS258 S99

Hardware Prefetching Sequential Prefetching: (1) One block lookahead (OBL) – prefetch block b+1 (Or b+x for high memory latency or stride computations) when block b is accessed– different than doubling block size because small block is considered for replacements and coherence actions. (a) Fetch-on-miss – Fetch the next block also when there is a cache miss for one block => will miss on every two blocks! (b) Tagged prefetch – A prefetched block is tagged with a “1”. Whenever, that block is touched by the CPU that tag is changed to “0”, and another block is prefetched and marked with a tag “1”. No miss unless there is a break in sequentiality of data use. (2) Degree of Prefetching – Fetch K blocks at a time for K > 1 (3) Adaptive Prefetching – Learn from the data use and adjust the value of K 12/9/2019 CS258 S99

Prefetching Problems Not all the data appear sequentially. How to avoid unnecessary data being prefetched? (1) Stride access for some scientific computations (2) Linked-list data – how to detect and prefetch? (3) predict data from program behavior? – EX. Mowry’s software data prefetch through compiler analysis and prediction, Hardare Reference Table (RPT) by Chen and Baer, Markov model by How to limit cache pollution? Stream Buffer technique by Jouppi is extremely helpful. What is a stream buffer compared to a victim buffer? 12/9/2019 CS258 S99

Prefetching in Multiprocessors Large Memory access latency, particularly in CC-NUMA, so prefetching is more useful Prefetches increase memory and IN traffic Prefetching shared data causes additional coherence traffic Invalidation misses are not predictable at compile time Dynamic task scheduling and migration may create further problem for prefetching. 12/9/2019 CS258 S99