Download presentation
Presentation is loading. Please wait.
Published byDeon Ivy Modified over 9 years ago
1
1 Optimizing compilers Managing Cache Bercovici Sivan
2
2 Overview Motivation Cache structure Important observations Techniques covered in the book – Loop interchange – Blocking – Unaligned Data – Pre-fetching
3
3 Overview (Cont.) Issues and techniques not covered in the book – Instruction Cache – Dynamic profiling driven cache optimization
4
4 Motivation Shorten fetch time Processor-DRAM Performance Gap: (grows 50% / year) 1 10 100 1000 198019811982 1983 19841985198619871988198919901991199219931994 1995 1996 1997199819992000 DRAM CPU Performance Time
5
5 Motivation (cont.) Solution: Cache – Faster memory Software problems – Maximize cache performance
6
6 Memory structure Hierarchical Registers Cache Memory Disk Instructions Blocks Pages Larger Faster 100s Bytes, <10s ns K Bytes,10-100 ns M Bytes, 100ns-1us Capacity, Access Time
7
7 Cache structure Specialized – Instruction – Data – What about stack cache?
8
8 Cache Structure (cont.) Organized into blocks – Multiple machine-words Maps entire memory Most use LRU replacement strategy Tag Line Tag Array Tag = Block# Address Fields 0431 Data array 03131 = == hit data Line Offset
9
9 Observations Temporal locality – If a variable is referenced, it tends to be referenced again soon. Spatial locality – If a variable is referenced, nearby variables tend to referenced soon
10
10 Observations (cont.) Temporal locality example – Variables used inside a loop Spatial locality example – Iterating on array items
11
11 Cache exploiting observations Temporal locality – Cache attempts to keep recently accessed data Spatial locality – The cache brings blocks of data from memory
12
12 Loop interchange
13
13 Example DO I = 1, M DO J = 1, N A(I, J) = A(I, J) + B(I, J) ENDDO I iterates on rows, J on columns Fortran arrays are column major – The first column is stored first, then the second column.. – C is the other way around (row major) No spatial reuse
14
14 Example - visually B DO I = 1, M DO J = 1, N A(I, J) = A(I, J) + B(I, J) ENDDO Cache mapping Cache miss A
15
15 Example analysis DO I = 1, M DO J = 1, N A(I, J) = A(I, J) + B(I, J) ENDDO 2*N*M misses – Due to the fact the innermost loop iterates on the non-contiguous dimension
16
16 Example fixed (loop interchange) DO J = 1, N DO I = 1, M A(I, J) = A(I, J) + B(I, J) ENDDO We process column by column spatial reuse
17
17 Example - visually DO J = 1, N DO I = 1, M A(I, J) = A(I, J) + B(I, J) ENDDO Cache mapping AB
18
18 Analyzing fixed example DO J = 1, N DO I = 1, M A(I, J) = A(I, J) + B(I, J) ENDDO 2*N*M/b misses – b is the cache-block size
19
19 A harder example DO I = 1, N DO J = 1, M D(I) = D(I) + B(I,J) ENDDO NM for B, N/b for D After interchange: NM/b for B, NM/b for D When should interchange? – N/b+NM - 2NM/b > 0
20
20 Loop interchange Determine which loop should be innermost – Strive to increase locality Heuristic approach – Compute cost function for each loop – Order loops: Cheapest loop innermost Most expensive, outermost
21
21 Cost assignment Cost is 1 for references that do not depend on loop induction variables Cost is N for references based on induction variables over a non-contiguous space Cost is N/b for induction variables based references over contiguous space Multiply the cost by the loop trip count if the reference varies with the loop index
22
22 Loop interchange) cont.) Special notes – Avoid over counting references Don’t overcount references that are available due to temporal reuse (available to next iterations) References can still be in the same cache block as other references (references in the same iteration) – Not all loops order are possible due to data dependency Find permutation that is both legal and best suits minimal score
23
23 Blocking
24
24 Back to the example DO J = 1, N DO I = 1, M D(I) = D(I) + B(I,J) ENDDO 2NM/b misses
25
25 Example - visually DB DO J = 1, N DO I = 1, M D(I) = D(I) + B(I,J) ENDDO Cache block size Cache miss
26
26 Back to the example (cont.) DO I = 1, N, S DO J = 1, M DO i2 = I, MIN(I+S-1, N) D(i2) = D(i2) + B(i2,J) ENDDO Work on smaller strips (s)
27
27 Example - visually DB DO I = 1, N, S DO J = 1, M DO i2 = I, MIN(I+S-1, N) D(i2) = D(i2) + B(i2,J) ENDDO Cache block size Second strip
28
28 Analysis Cost of B does not change: NM/b Cost of D effect due to reuse: N/b – No misses during iterations on J Conclude: (1+1/M)NM/b DO I = 1, N, S DO J = 1, M DO i2 = I, MIN(I+S-1, N) D(i2) = D(i2) + B(i2,J) ENDDO
29
29 Unaligned data What if B is not aligned on cache block boundary? At most an additional penalty for each sub- column iteration – Additional NM/S Conclude: (1+1/M+b/S)NM/b DO I = 1, N, S DO J = 1, M DO i2 = I, MIN(I+S-1, N) D(i2) = D(i2) + B(i2,J) ENDDO
30
30 Unaligned data - visually DB DO I = 1, N, S DO J = 1, M DO i2 = I, MIN(I+S-1, N) D(i2) = D(i2) + B(i2,J) ENDDO Strip size Cache alignment Additional miss
31
31 Unaligned data (cont.) What can be done? – Enforce data alignment – Refine our score for loop interchange include these misses as well in the score DO I = 1, N, S DO J = 1, M DO i2 = I, MIN(I+S-1, N) D(i2) = D(i2) + B(i2,J) ENDDO
32
32 Blocking - legality Split to strips - Legal Interchange – not always legal procedure StripMineAndInterchange (L, m, k, o, S) // L = {L 1, L 2,..., L m }is the loop nest to be transformed // L k is the loop to be strip mined // L o is the outer loop which is to be just inside the by-strip loop //after interchange // S is the variable to use as strip size; it’s value must be positive let the header of L k be DO I = L, N, D; split the loop into two loops, a by-strip loop: DO I = L, N, S*D and a within-strip loop: DO i = I, MAX(I+S*D-D,N), D around the loop body; interchange the by-strip loop to the position just outside of L o ; end StripMineAndInterchange
33
33 Blocking – a harder example DO I = 1, N DO J = 1, M A(J+1) = (A(J) + A(J+1))/2 ENDDO Due to dependence, loops can not be interchanged Statements Dependencies between statements
34
34 Blocking – a closer look DO I = 1, N DO J = 1, M A(J+1) = (A(J) + A(J+1))/2 ENDDO Not computable due to dependencies Bad performance due to low cache reuse
35
35 Harder example – skew it DO I = 1, N DO j = I, M+I-1 A(j-I+2) = (A(j-I+1) + A(j-I+2))/2 ENDDO j = 1..M 2..M+1 3..M+2
36
36 Harder example - Strip it DO I = 1, N DO j = I, M+I-1, S DO jj = j, MAX(j+S-1, M+I-1) A(jj-I+2) = (A(jj-I+1) + A(jj-I+2))/2 ENDDO
37
37 Hard example – interchange loops DO j = 1, M+N-1, S DO I = MAX(1, j-M+1), MIN(j, N) DO jj = j, MAX(j+S-1, M+I-1) A(jj-I+2) = (A(jj-I+1) + A(jj-I+2))/2 ENDDO
38
38 Harder example - Comparison NM/b misses(M+N)*(1/b+1/S) misses
39
39 Triangular blocking DO I = 2, N DO J = 1, I-1 A(I, J) = A(I, I) + A(J, J) ENDDO Explicitly: 1..2 1...3 1....4 1…..5
40
40 Triangular – Strip it DO I = 2, N, K DO ii = I, I+K-1 DO J = 1, ii – 1 A(ii, J) = A(ii, I) + A(J, J) ENDDO K-size strips Nothing important changed yet..
41
41 Triangular – transform! DO I = 2, N, K DO J = 1, I+K-1 DO ii = MAX(J+1, I), I+K-1 A(ii, J) = A(ii, I) + A(J, J) ENDDO Triangular loop interchange Working on the k-strips Preserving correct triangular loop limits
42
42 Blocking with parallelization Dimension of parallelism is that of the sequential access – Solution: If multiple parallelization dimensions are available, avoid the stride-one dimension False sharing – Data used by different processors is on the same cache-line, but not the exact same data – Solution: Language extension - expressing data division to processors. Memory data layout accordingly
43
43 Prefetch
44
44 “…And I don’t want to miss a thing…” - AeroSmith, 98’ Optimization Seminar Problematic misses: – Data used for the first time – Data re-used in ways that can not be predicted at compile-time DO I=1,N A(I) = B(LOC(I)) ENDDO
45
45 Prefetch Brings a line into the cache Typically does not cause stall – Loads the line parallel to continues execution Introduced by programmer/compiler
46
46 Prefetch Advantages – Miss latencies can be avoided Assuming we can introduce a prefetch far enough Assuming the cache is large enough Disadvantages – Number of instruction to execute increases – May cause useful data inside cache to be evacuated prematurely – Data brought by prefetch might be evicted prematurely.
47
47 Minimizing disadvantages impact The number of added prefetches must be close to what is needed Prefetches should not arrive “too early”
48
48 Do J=1, M Do I=1, 32 A(I+1,J) = A(I,J) + C(J) ENDDO Group generator is not contained in a dependence cycle – a miss is expected on each iteration unless references to the generator on subsequent iterations display temporal locality Generate miss on every new cache line Use prefetch to before references to generators RAW Identify prefetch opportunities
49
49 Acyclic name partitioning Two cases: – references to the generator do not iterate sequentially within the loop – references have spatial locality within the loop Do I=1, 32 Do J=1, M A(I+1,J) = A(I,J) + C(J) ENDDO Do J=1, M Do I=1, 32 A(I+1,J) = A(I,J) + C(J) ENDDO
50
50 Acyclic name partitioning Case I: references to the generator do not iterate sequentially within the loop insert prefetch before each reference to the generator Final positioning of the prefetches will be determined by the instruction scheduler Do I=1, 32 Do J=1, M A(I+1,J) = A(I,J) + C(J) ENDDO Prefetch
51
51 Case II: references have spatial locality within the loop – Determine i 0 of the first iteration after the initial iteration that causes a miss on the access to the generator – Determine iteration delta between misses in the cache Acyclic name partitioning Do J=1, M Do I=1, 32 A(I+1,J) = A(I,J) + C(J) ENDDO pre-loop Main loop
52
52 Acyclic with spatial reuse Partition the loop into two parts – initial subloop running from 1 to i o -1 – remainder running from i o to the end In the example, i o =4 DO I = 1, 3 A(I, J) = A(I, J) + A(I-1, J) ENDDO DO I = 4, M A(I, J) = A(I, J) + A(I-1, J) ENDDO DO I = 1, M A(I, J) = A(I, J) + A(I-1, J) ENDDO
53
53 Acyclic with spatial reuse Strip mine the second loop to have subloops of length delta In the example, delta=4 DO I = 1, 3 A(I, J) = A(I, J) + A(I-1, J) ENDDO DO I = 4, M, 4 IU = MIN(M, I+4) DO ii = I, IU A(ii, J) = A(ii, J) + A(ii-1, J) ENDDO DO I = 1, 3 A(I, J) = A(I, J) + A(I-1, J) ENDDO DO I = 4, M A(I, J) = A(I, J) + A(I-1, J) ENDDO
54
54 Acyclic with spatial reuse Insert a prefetch before the initial loop Insert prefetches before the inner-loop prefetch(A(0,J)) DO I = 1, 3 A(I, J) = A(I, J) + A(I-1, J) ENDDO DO I = 4, M, 4 IU = MIN(M, I+3) prefetch(A(I, J)) DO ii = I, IU A(ii, J) = A(ii, J) + A(ii-1, J) ENDDO DO I = 1, 3 A(I, J) = A(I, J) + A(I-1, J) ENDDO DO I = 4, M, 4 IU = MIN(M, I+4) DO ii = I, IU A(ii, J) = A(ii, J) + A(ii-1, J) ENDDO
55
55 Acyclic with spatial reuse prefetch(A(0,J)) DO I = 1, 3 A(I, J) = A(I, J) + A(I-1, J) ENDDO DO I = 4, M, 4 IU = MIN(M, I+3) prefetch(A(I, J)) DO ii = I, IU A(ii, J) = A(ii, J) + A(ii-1, J) ENDDO DO I = 1, 3 A(I, J) = A(I, J) + A(I-1, J) ENDDO DO I = 4, M, 4 IU = MIN(M, I+4) DO ii = I, IU A(ii, J) = A(ii, J) + A(ii-1, J) ENDDO DO I = 1, M A(I, J) = A(I, J) + A(I-1, J) ENDDO
56
56 Do J=1, M Do I=1, 32 A(I+1,J) = A(I,J) + C(J) ENDDO Group generator is contained in a dependence cycle – a miss is expected only on the first few iterations of the carrying loop prefetch to the reference can be placed before the loop carrying the dependence Identify prefetch opportunities Input Dependence
57
57 Put it all together Rearrange the loop nest so that the loop iterating sequentially over cache lines is innermost Split the innermost loop into two – – Pre-loop to the first iteration of the innermost loop containing a generator reference beginning on a new cache line and – Main loop that begins with the iteration containing the new cache reference. Insert the prefetch, as previously explained
58
58 Example DO J = 1, M DO I = 2, 33 A(I, J) = A(I, J) * B(I) ENDDO prefetch(B(2)) DO I = 5, 33, 4 prefetch(B(I)) ENDDO DO J = 1, M DO I = 2, 33 A(I, J) = A(I, J) * B(I) ENDDO prefetch(B(2)) DO I = 5, 33, 4 prefetch(B(I)) ENDDO DO J = 1, M prefetch(A(2,J)) DO I = 2, 4 A(I, J) = A(I, J) * B(I) ENDDO DO I = 5, 33, 4 prefetch(A(I, J)) A(I, J) = A(I, J) * B(I) A(I+1, J) = A(I+1, J) * B(I+1) A(I+2, J) = A(I+2, J) * B(I+2) A(I+3, J) = A(I+3, J) * B(I+3) ENDDO prefetch(A(33, J)) A(33, J) = A(33, J) * B(33) ENDDO
59
59 Effectiveness of prefetching
60
60 What did we miss?
61
61 “..Sometimes, is never quite enough..” - Alanis Morissette, 95’ optimization seminar Static analysis often ineffective – missing information: – Run-time cache miss – Miss address information
62
62 Profiling based optimization Dynamic optimization systems – Collect information dynamically – Optimize according to profile Use collected information for re-compilation, optimizing accordingly. Use collected information to perform run-time optimization (Code modification on run-time to perform pre-fetch) – Example: ADORE
63
63
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.