Download presentation
Presentation is loading. Please wait.
1
Interconnection Network and Prefetching
Lecture 12 CS 213 CS258 S99
2
Origin2000 System Overview
Single 16”-by-11” PCB Directory state in same or separate DRAMs, accessed in parallel Upto 512 nodes (1024 processors) With 195MHz R10K processor, peak 390MFLOPS or 780 MIPS per proc Peak SysAD bus bw is 780MB/s, so also Hub-Mem Hub to router chip and to Xbow is 1.56 GB/s (both are off-board) 12/9/2019 CS258 S99
3
Origin Network Each router has six pairs of 1.56MB/s unidirectional links Two to nodes, four to other routers latency: 41ns pin to pin across a router Flexible cables up to 3 ft long Four “virtual channels”: request, reply, other two for priority or I/O 12/9/2019 CS258 S99
4
Origin I/O Xbow is 8-port crossbar, connects two Hubs (nodes) to six cards Similar to router, but simpler so can hold 8 ports Except graphics, most other devices connect through bridge and bus can reserve bandwidth for things like video or real-time Global I/O space: any proc can access any I/O device through uncached memory ops to I/O space or coherent DMA any I/O device can write to or read from any memory (comm thru routers) 12/9/2019 CS258 S99
5
Case Study: Cray T3D Build up info in ‘shell’
Remote memory operations encoded in address 12/9/2019 CS258 S99
6
Case Study: NOW General purpose processor embedded in NIC to implement VIA, discussed earlier 12/9/2019 CS258 S99
7
Reducing Communication Cost
Reducing effective latency Avoiding Latency Tolerating Latency communication latency vs. synchronization latency vs. instruction latency sender initiated vs receiver initiated communication 12/9/2019 CS258 S99
8
Approaches to Latency Tolerance
Large block transfer make individual transfers larger Precommunication or prefetching generate comm before point where it is actually needed proceeding past an outstanding communication event continue with independent work in same thread while event outstanding – Speculative execution multithreading - finding independent work switch processor to another thread 12/9/2019 CS258 S99
9
How much can you gain? Overlaping computation with all communication
with one communication event at a time? Overlapping communication with communication let C be fraction of time in computation... Let L be latency and r the run length of computation between messages, how many messages must be outstanding to hide L? What limits outstanding messages overhead occupancy bandwidth network capacity? 12/9/2019 CS258 S99
10
Block Data Transfer Message Passing Shared Address space fragmentation
local coherence problem global coherence problem 12/9/2019 CS258 S99
11
Benefits Under CPS Scaling
12/9/2019 CS258 S99
12
Precommunication Prefetching – Explain Fig. 2 and 3 of Lilja’s paper
Instruction Vs Data Prefetch: Inst prefetch is easy but data prefetch is difficult due to unpredictable use of data in different applications Software Prefetch: Explicit prefetch instns are inserted in the program (prefetch scheduling)– Determination of when to put and where to put these instrns is difficult – either done manually or through difficult compiler optimization process. Hardware Prefetch: A hardware is built in CPU to prefetch data automatically from memory – When to prefetch and how much to prefetch difficult to determine. 12/9/2019 CS258 S99
13
Problems in Prefetching
Unnecessary data being prefetched will result in increased bus and memory traffic degrading performance – for data not being used and for data arriving late Prefetched data may replace data in the processor working set – Cache pollution problem – What is stream buffer? Invalidation of prefetched data by other processors or DMA Summary: Prefetch is necessary, but how to prefetch, which data to prefetch, and when to prefetch are difficult questions that must be answered. 12/9/2019 CS258 S99
14
Software Prefetching Ref: VanderWiel andLilja, ACM Computing Surveys, June 2000
Consider the following Example. This loop calculates the inner product of two vectors a and b. (a) no prefetching: For (I=0; I < N; I++) Ip = ip + a[I] * b[I]; Assuming a 4-word cache block, this code segment will cause a cache miss every 4th iteration. (b) Simple prefetching: For (I=0; I < N; I++) { fetch ( &a[I+1]; fetch ( &b[I+1]; Ip = ip + a[I] * b[I]; } Problems: Why add prefetch on every loop – one fetch will bring in four words, so the values a[I+1] and b[I+1] are available in memory for 4 loops! Unnecessary prefetches will degrade performance. Prefetching should be done only every fourth iteration. 12/9/2019 CS258 S99
15
Software Prefetching Cond.
© prefetching with loop unrolling: Unroll the loop by a factor r, where r is the number of words fetched per cache block For (I = 0; I < N; I+=4) { fetch ( &a[I+4] ); fetch ( &b[I+4] ); Ip = ip + a[I] * b[I]; Ip = ip + a[I+1] * b[I+1]; Ip = ip + a[I+2] * b[I+2]; Ip = ip + a[I+3] * b[I+3]; } Problems: When I = 0, the prefetch is for second block giving rise to cache misses during the 1st iteration. Also, why should we prefetch during the last iteration? We don’t need the data. 12/9/2019 CS258 S99
16
Software Prefetching Contd.
(d) Software Pipelining Fetch ( &ip) | Fetch ( &a[0]); | => Prolog Fetch ( &b[0]); | For (I = 0; I < N-4; I+=4) { fetch ( &a[I+4] ); fetch ( &b[I+4] ); Ip = ip + a[I] * b[I]; Ip = ip + a[I+1] * b[I+1]; => MAIN LOOP Ip = ip + a[I+2] * b[I+2]; Ip = ip + a[I+3] * b[I+3]; } For ( ; I < N; I++) | Ip = ip + a[I] * b[I]; | => Epilog Problem: Implicit assumption in the above techniques is that prefetching one iteration ahead will hide the latency. What if a memory fetch operation takes more time? The prefetches really should be initiated X iterations ahead, where X = ceiling [L/S]. Here L = Av memory latency in cycles and S = time for computation of one iteration. 12/9/2019 CS258 S99
17
Software Prefetching Contd
Assuming l = 100 cycles, S = 45 cycles, rewrite the final code. Fetch ( &ip); For (I = 0; I < 12; I +=4) { fetch ( &a[I]); fetch ( &b[I]); => Prolog – prefetching only } For (I = 0; I < N-12; I+=4) { fetch ( &a[I+12] ); fetch ( &b[I+12] ); Ip = ip + a[I] * b[I]; Ip = ip + a[I+1] * b[I+1]; => MAIN LOOP – prefetching and computation Ip = ip + a[I+2] * b[I+2]; Ip = ip + a[I+3] * b[I+3]; For ( ; I < N; I++) | Ip = ip + a[I] * b[I]; | => Epilog – computation only 12/9/2019 CS258 S99
18
Hardware Prefetching Sequential Prefetching: (1) One block lookahead (OBL) – prefetch block b+1 (Or b+x for high memory latency or stride computations) when block b is accessed– different than doubling block size because small block is considered for replacements and coherence actions. (a) Fetch-on-miss – Fetch the next block also when there is a cache miss for one block => will miss on every two blocks! (b) Tagged prefetch – A prefetched block is tagged with a “1”. Whenever, that block is touched by the CPU that tag is changed to “0”, and another block is prefetched and marked with a tag “1”. No miss unless there is a break in sequentiality of data use. (2) Degree of Prefetching – Fetch K blocks at a time for K > 1 (3) Adaptive Prefetching – Learn from the data use and adjust the value of K 12/9/2019 CS258 S99
19
Prefetching Problems Not all the data appear sequentially. How to avoid unnecessary data being prefetched? (1) Stride access for some scientific computations (2) Linked-list data – how to detect and prefetch? (3) predict data from program behavior? – EX. Mowry’s software data prefetch through compiler analysis and prediction, Hardare Reference Table (RPT) by Chen and Baer, Markov model by How to limit cache pollution? Stream Buffer technique by Jouppi is extremely helpful. What is a stream buffer compared to a victim buffer? 12/9/2019 CS258 S99
20
Prefetching in Multiprocessors
Large Memory access latency, particularly in CC-NUMA, so prefetching is more useful Prefetches increase memory and IN traffic Prefetching shared data causes additional coherence traffic Invalidation misses are not predictable at compile time Dynamic task scheduling and migration may create further problem for prefetching. 12/9/2019 CS258 S99
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.