Download presentation
Presentation is loading. Please wait.
Published byDella Griselda McBride Modified over 9 years ago
1
Increasing and Detecting Memory Address Congruence Sam Larsen Emmett Witchel Saman Amarasinghe Laboratory for Computer Science Massachusetts Institute of Technology
2
The Congruence Property int a[M], b[n]; for (i=0; i<n; i++) { a[b[i]*8] = 0; } 048… i=0
3
The Congruence Property int a[M], b[n]; for (i=0; i<n; i++) { a[b[i]*8] = 0; } 048… i=0 i=1
4
The Congruence Property int a[M], b[n]; for (i=0; i<n; i++) { a[b[i]*8] = 0; } 048… i=0 i=1 i=2
5
The Congruence Property int a[M], b[n]; for (i=0; i<n; i++) { a[b[i]*8] = 0; } 048…
6
The Congruence Property int a[M], b[n]; for (i=0; i<n; i++) { a[b[i]*8] = 0; } 0481216 202428 Congruent with offset of 0
7
The Congruence Property int a[M]; for (i=0; i<n; i++) { a[16*i+2] = 0; } 0481216 202428 Congruent with offset of 8
8
The Congruence Property int a[M]; for (i=0; i<n; i++) { a[15*i+3] = 0; } 0481216 202428 Not Congruent (32-byte line)
9
Outline Uses of congruence information Congruence detection algorithm Congruence-increasing transformations Results Related work
10
SIMD Compilation [PLDI ’00] Multimedia extensions offer wide mem ops –Motorola’s AltiVec –Intel’s MMX/SSE Automatic SIMD parallelization –Multiple mem ops single wide mem op 128-bit lds/strs must be 128-bit aligned –SSE: 6-9 cycle penalty for unaligned accesses –AltiVec: All wide mem ops have to be aligned
11
Energy Savings [Micro ’01] Skip tag checks in a set-associative cache Add special loads/stores to ISA –First mem op memoizes the cache way –Second mem op uses this to skip the check Compiler analysis determines when data occupy the same line –Need congruence information
12
Banked Memory Architectures Offset specifies the memory bank –Place data close to computation –Access banks in parallel regfile memory 0 regfile memory 4 regfile memory 8 regfile memory 12
13
Congruence Recognition Iterative dataflow analysis –Low-level IR Lattice elements of the form an+b –For pointers, memory locations accessed If a = cache line size then b = offset –32n+8 accesses offset 8 in a 32-byte line 0481216 20 28 24
14
Dataflow Lattice 8 byte cache line 2n+02n+1 4n+04n+24n+14n+3 8n+08n+48n+28n+68n+18n+58n+38n+7 n+0
15
Dataflow Lattice 2n+02n+1 4n+04n+24n+14n+3 8n+08n+48n+28n+68n+18n+58n+38n+7 8n+0 4n+2 2n+0 n+0
16
Transfer Functions a = gcd(a 1, a 2, |b 1 -b 2 |) b = b 1 % a Meet
17
Transfer Functions a = gcd(a 1, a 2, |b 1 -b 2 |) b = b 1 % a a = gcd(a 1, a 2 ) b = (b 1 +b 2 ) % a a = gcd(a 1, a 2 ) b = (b 1 – b 2 ) % a a = gcd(a 1 a 2, a 1 b 2, a 2 b 1, C) b = (b 1 b 2 ) % a Meet Add Subtract Multiply
18
The Bad News Most mem ops are not congruent –32 byte cache line
19
Congruence Conventions (Padding) Allocate arrays/structs on a line boundary –Congruent accesses to arrays for a given index –Congruent accesses to struct fields Requires that we: –Allocate stack frames on cache line boundary –Modify malloc to return aligned data
20
Unrolling Unrolling creates congruent references int a[100]; for (i=0; i<n; i+=8) { a[i+0] = 0; a[i+1] = 0; … a[i+7] = 0; } a[0] a[8] a[16]… 0481216202428
21
Unrolling Unrolling creates congruent references int a[100]; for (i=0; i<n; i+=8) { a[i+0] = 0; a[i+1] = 0; … a[i+7] = 0; } a[1] a[9] a[17]… 0481216202428
22
Congruence with Parameters void init(int* a) { for (i=0; i<n; i+=8) { a[i+0] = 0; a[i+1] = 0; … a[i+7] = 0; } void main() { int a[100]; init(&a[2]); init(&a[3]); } 0481216202428
23
Congruence with Parameters void init(int* a) { for (i=0; i<n; i+=8) { a[i+0] = 0; a[i+1] = 0; … a[i+7] = 0; } void main() { int a[100]; init(&a[2]); init(&a[3]); } 0481216202428
24
Pre-loop Add a pre-loop to enforce congruence for (i=0; i<n; i++) { if ((int)&a[i] % 32 == 0) break; a[i] = 0; } for (; i<n; i+=8) { a[i+0] = 0; a[i+1] = 0; … a[i+7] = 0; } 0481216202428
25
Pre-loop Add a pre-loop to enforce congruence Mem ops congruent in the unrolled body Pre-loop has few iterations –Most dynamic mem ops are congruent
26
Finding the Break Condition Can we choose arbitrarily? void init(int *x) { int i; for (i=0; i<100; i+=2) { if ((int)&x[i] % 32 == 0) break; x[i] = 0; }... } int main() { int x[200]; init(&x[1]); } i&x[i]%32 04 212 420 628 84 NO!
27
Finding the Break Condition void copy(int *x, int *y) { int i; for (i=0; i<100; i++) { if ((int)&x[i] % 32 == 0 && (int)&y[i] % 32 == 0) break; x[i] = y[i]; }... } int main() { int x[200], y[200]; copy(&x[0], &y[0]); copy(&x[0], &y[1]); } i&x[i]%32&y[i]%32 000 144 ……… 800 i&x[i]%32&y[i]%32 004 148 ……… 804 first call second call
28
Finding the Break Condition void copy(int *x, int *y) { int i; for (i=0; i<100; i++) { if ((int)&x[i] % 32 == 0 && (int)&y[i] % 32 == 0) break; x[i] = y[i]; }... } int main() { int x[200], y[200]; copy(&x[0], &y[0]); copy(&x[0], &y[1]); } i&x[i]%32&y[i]%32 000 144 ……… 800 i&x[i]%32&y[i]%32 004 148 ……… 804 first call second call
29
Finding the Break Condition void copy(int *x, int *y) { int i; for (i=0; i<100; i++) { if ((int)&x[i] % 32 == 0 && (int)&y[i] % 32 == 4) break; x[i] = y[i]; }... } int main() { int x[200], y[200]; copy(&x[0], &y[0]); copy(&x[0], &y[1]); } i&x[i]%32&y[i]%32 000 144 ……… 800 i&x[i]%32&y[i]%32 004 148 ……… 804 first call second call
30
Finding the Break Condition void copy(int *x, int *y) { int i; for (i=0; i<100; i++) { if ((int)&x[i] % 32 == 0) break; x[i] = y[i]; }... } int main() { int x[200], y[200]; copy(&x[0], &y[0]); copy(&x[0], &y[1]); } i&x[i]%32&y[i]%32 000 144 ……… 800 i&x[i]%32&y[i]%32 004 148 ……… 804 first call second call
31
Finding the Break Condition Use profiling to observe runtime addresses Find best break condition for the profile Exhaustive search: –Consider all possible break conditions –Compute iterations in unrolled loop –Multiply by # of mem ops with known offset –Break condition with highest value is the best Results vary little with profile data set –Insignificant on all but one benchmark
32
Congruence Results (SPECfp95) Original Congruent
33
Congruence Results (SPECfp95) Original Congruent Detected
34
Congruence Results (MediaBench) Original Congruent Detected
35
Execution Time Overhead unrolling+ pre-loop applu-6.27%-5.28% apsi 0.93% 1.13% fpppp 0.00% hydro2d 0.99% 0.39% mgrid 0.72% su2cor-0.32% 0.11% swim-0.96%-0.17% tomcatv-0.18% 0.65% turb3d-0.80% 1.72% wave5 3.75% 4.58%
36
DCache Energy Savings [Micro ’01]
37
Related Work Fisher and Ellis – Bulldog Compiler –Memory bank disambiguation –Loop unrolling Barua et al. – Raw Compiler –Modulo unrolling Davidson et al. – Mem Access Coalescing –Loop Unrolling –Alignment checks at runtime
38
Conclusions Increased number of congruent refs by 5x Analysis detected 95% Results are good –MediaBench – 65% congruent, 60% detected –SpecFP95 – 84% congruent, 82% detected Many uses of congruence information –Wide accesses in multimedia extensions –Energy savings by tag check elimination –Bank disambiguation in clustered architectures
39
Increasing and Detecting Memory Address Congruence Sam Larsen Emmett Witchel Saman Amarasinghe Laboratory for Computer Science Massachusetts Institute of Technology
40
r0 = i+7 r1 = r0*4 r2 = r1+a *r2 = 0 i = i+8 i < n i = 0 Example int a[100]; for (i=0; i<n; i+=8) { a[i+0] = 0; a[i+1] = 0; … a[i+7] = 0; }
41
Example i: 32n+0 r0: 32n+0 + 32n+7 = 32n+7 r1: 32n+7 * 32n+4 = 32n+28 r2: 32n+28 + 32n+0 = 32n+28 i: 32n+0 + 32n+8 = 32n+8 r0 = i+7 r1 = r0*4 r2 = r1+a *r2 = 0 i = i+8 i < n i = 0
42
i: 32n+0 r0: 32n+0 + 32n+7 = 32n+7 r1: 32n+7 * 32n+4 = 32n+28 r2: 32n+28 + 32n+0 = 32n+28 i: 32n+0 + 32n+8 = 32n+8 Example i: 32n+0 r0: 8n+0 + 32n+7 = 8n+7 r1: 8n+7 * 32n+4 = 32n+28 r2: 32n+28 + 32n+0 = 32n+28 i: 8n+0 + 32n+8 = 8n+0 i: 32n+0 32n+8 = 8n+0 *r2: offset is 28 r0 = i+7 r1 = r0*4 r2 = r1+a *r2 = 0 i = i+8 i < n i = 0
43
Multimedia Compilation PowerMAC G4 with AltiVec Commercial vectorizing compiler –Alignment pragmas datatypeVector length Speedup (unaligned) Speedup (aligned) Improve- ment float43.254.7546% int42.152.9336% short82.985.8797% char165.2111.53121%
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.