Modeling of Digital Systems CS 812 High Level Design & Modeling of Digital Systems MEMORY SYNTHESIS Bhuvan Middha (csu98133) Arun Kejariwal (eeu98172)
Presentation Plan Motivation Impact of Memory Architecture Decisions Optimizations in Memory Synthesis Memory Assignment of array variables Scratch-Pad Memory Conclusion References
Motivation Rate of Performance Improvement is different CPU Speed Memory Year Speed CPU Rate of Performance Improvement is different
Impact on Processor Pipeline Dec ALU MEM IF WB IF Dec ALU MEM WB IF Dec ALU MEM WB Clock cycle determined by slowest pipeline stage
Impact of Memory Architecture Decisions Area 50-70% of ASIC/ASIP may be memory Performance 10-90% of system performance may be memory related Power 25-40% of system power may be memory related
Issues in Memory Synthesis Number of distributed registers Number of register files Number of register file ports On-chip or Off-chip memory Cache Parameters Cache Vs Scratch pad Number of memory ports Memory bus Bandwidth Data Organization and Partitioning
Optimizations in Memory Synthesis Code Optimizations R-M-W Mode Clustering of Scalar variables Reordering Hoisting Loop Transformations Memory assignment of array variables Hardware Optimizations Scratch Pad Banking
Storing Multi-dimensional Arrays: Row-major int X [4][4]; Row-major Storage Physical Memory Logical Memory 15
Storing Multi-dimensional Arrays: Column-major int X [4][4]; Column-major Storage Physical Memory Logical Memory 15
Storing Multi-dimensional Arrays: Tile-based int X [4][4]; Tile-based Storage Physical Memory Logical Memory 15
Array Layout and Data Cache a[i] int a [1024]; int b[1024]; int c [1024]; ... for (i = 0; i < N; i++) c [i] = a [i] + b [i]; b b[i] c Data Cache (Direct-mapped, 512 words) c[i] Memory Problem: Every access leads to cache miss
Data Alignment a a[i] int a [1024]; int b[1024]; int c [1024]; ... for (i = 0; i < N; i++) c [i] = a [i] + b [i]; DUMMY b b[i] DUMMY c c[i] Data Cache (Direct-mapped, 512 words) Memory Data alignment avoids cache conflicts
Data Layout Transformation Splitting structs into individual arrays Account for pointer arithmetic, dereferencing Clustering of arrays
Motivating Example Arrays Loop 1 Loop 2 struct x { int a; int b; int q [1000]; ... avg = 0; for (i = 0; i < 1000; i++) avg = avg + p[i].a; avg = avg / 1000; for (i = 0; i < 1000; i++) { p[i].b = p[i].b + avg; q[i] = p[i].b + 1; } Arrays Loop 1 Loop 2
Cache Performance: Loop 1 Data Cache [Direct-mapped 4 lines, 2 words/line] struct x { int a; int b; } p [1000]; int q [1000]; ... avg = 0; for (i = 0; i < 1000; i++) avg = avg + p[i].a; avg = avg / 1000; for (i = 0; i < 1000; i++) { p[i].b = p[i].b + avg; q[i] = p[i].b + 1; } Useless Data p[0].a p[0].b Loop 1 p[1].a p[1].b 1 p[2].a p[2].b 2 p[3].a p[3].b 3 Line
Cache Performance: Loop 2 Data Cache [Direct-mapped 4 lines, 2 words/line] struct x { int a; int b; } p [1000]; int q [1000]; ... avg = 0; for (i = 0; i < 1000; i++) avg = avg + p[i].a; avg = avg / 1000; for (i = 0; i < 1000; i++) { p[i].b = p[i].b + avg; q[i] = p[i].b + 1; } p[0].a p[0].b p[1].a p[1].b 1 Loop 2 q[0] q[1] 2 3 Line Useless Data
Cache Performance 1000 cache misses for p[i].a 1500 cache misses struct x { int a; int b; } p [1000]; int q [1000]; ... avg = 0; for (i = 0; i < 1000; i++) avg = avg + p[i].a; avg = avg / 1000; for (i = 0; i < 1000; i++) { p[i].b = p[i].b + avg; q[i] = p[i].b + 1; } 1000 cache misses for p[i].a 1500 cache misses 1000 misses for p[i].b 500 misses for q[i] Cache miss rate: 62.5%
Transformed Data Layout struct x { int a; int b; } p [1000]; int q [1000]; struct y { int q; // originally q int b; // originally x.b } r [1000]; int a [1000]; // originally x.a Loop 2 Loop 1 ... avg = 0; for (i = 0; i < 1000; i++) avg = avg + a[i]; avg = avg / 1000; for (i = 0; i < 1000; i++) { r[i].b = r[i].b + avg; r[i].q = r[i].b + 1; }
Cache Performance: Loop 1 Data Cache [Direct-mapped 4 lines, 2 words/line] struct y { int q; // originally q int b; // originally x.b } r [1000]; int a [1000]; // originally x.a ... avg = 0; for (i = 0; i < 1000; i++) avg = avg + a[i]; avg = avg / 1000; for (i = 0; i < 1000; i++) { r[i].b = r[i].b + avg; r[i].q = r[i].b + 1; } a[0] a[1] Loop 1 a[2] a[3] 1 2 3 No useless data in cache Line
Cache Performance: Loop 2 Data Cache [Direct-mapped 4 lines, 2 words/line] struct y { int q; // originally q int b; // originally x.b } r [1000]; int a [1000]; // originally x.a ... avg = 0; for (i = 0; i < 1000; i++) avg = avg + a[i]; avg = avg / 1000; for (i = 0; i < 1000; i++) { r[i].b = r[i].b + avg; r[i].q = r[i].b + 1; } r[0].q r[0].b r[1].q r[1].b 1 2 Loop 2 3 No useless data in cache Line
Cache Performance Cache miss rate: 37.5% 500 cache misses struct y { int q; // originally q int b; // originally x.b } r [1000]; int a [1000]; // originally x.a ... avg = 0; for (i = 0; i < 1000; i++) avg = avg + a[i]; avg = avg / 1000; for (i = 0; i < 1000; i++) { r[i].b = r[i].b + avg; r[i].q = r[i].b + 1; } 500 cache misses 1000 cache misses Cache miss rate: 37.5%
Clustering of Arrays 8 + 16 24 int a[16], b[16], c[16] For i = 0 to 7 a[i] = b[i+3] + 3 For j = 0 to 15 a[i] = b[i] * c[i] a b 16 16 16 c
Scratch Pad Memory Data memory residing on chip Address space disjoint from off-chip memory Same address and data bus as that for off chip memory Guaranteed small access time as no read/write miss
Memory Address Space On-chip Memory CPU Off-chip Memory Data Cache 1 1-cycle cycle On-chip Memory CPU P-1 Off-chip Memory P Data Cache (on-chip) Memory Address Space 1-cycle 1 cycle 10-20 cycles 10-20 cycles N-1
Scratch Pad Model Organization of scratch pad memory No comparison is needed A priori knowledge of the memory objects an added advantage Scratch pad memory constitutes the data array unit, decoder unit and the peripheral unit
Why Scratchpad? Unordered array variables and scalars lead to a large number of conflict misses in the cache Accesses are data dependent, so data layout techniques are ineffective Example : char BrightnessLevel[512][512] int Hist[256] for i = 0 to 512 for j = 0 to 512 level = Brightnesslevel[I][j] Hist[level] = Hist[level]+1
Data Partitioning Minimize the interference between different variables in the data cache Partitioning of variables is governed by the following code characteristics : - scalar variables and constants - size of arrays - life time of variables - access frequency of variables - loop conflicts
Access Frequency of Variables and Loop Conflicts Variable Access Count (VAC) Interference Access Count (IAC) Interference Factor (IF) IF(u) = VAC(u) + IAC(u) Map variables with high IF values into the scratch pad memory Loop Conflict Factor (LCF) Map variables with high LCF number to scratch pad memory
Formulation of Partitioning problem Total Conflict Factor (TCF) TCF(u) = IF(u) + LCF(u) Given a set of n arrays with corresponding TCF values find an optimal subset such that total size <= size of SRAM and total TCF value is maximized Similar to knapsack problem except the fact that several arrays with non intersecting lifetimes can share the same SRAM space
Conclusion Z-Buffering - Graphics Stream buffers - Data pre-fetching Stride Prediction tables - predict memory references Inter-array windowing - multi-dimensional arrays
References Books Survey Paper P. Panda, N. Dutt, A. Nicolau - Memory issues in embedded systems-on-chip: optimization and exploration, Kluwer Academic Publishers, 1999 F. Catthoor, S. Wuytack, E. De Greef, F. Balasa, L. Nachtergaele, A. Vandecappelle – Custom memory management methodology, Kluwer Academic Publishers, 1998o Survey Paper P. Panda, F. Catthoor, N. Dutt, K. Danckaert, E. Brockmeyer, C. Kulkarni, A. Vandecappelle – Data and Memory Optimization Techniques for Embedded Systems, ACM Transactions on Design Automation of Embedded Systems, April 2001