Download presentation
Presentation is loading. Please wait.
1
Modeling of Digital Systems
CS 812 High Level Design & Modeling of Digital Systems MEMORY SYNTHESIS Bhuvan Middha (csu98133) Arun Kejariwal (eeu98172)
2
Presentation Plan Motivation Impact of Memory Architecture Decisions Optimizations in Memory Synthesis Memory Assignment of array variables Scratch-Pad Memory Conclusion References
3
Motivation Rate of Performance Improvement is different CPU Speed
Memory Year Speed CPU Rate of Performance Improvement is different
4
Impact on Processor Pipeline
Dec ALU MEM IF WB IF Dec ALU MEM WB IF Dec ALU MEM WB Clock cycle determined by slowest pipeline stage
5
Impact of Memory Architecture Decisions
Area 50-70% of ASIC/ASIP may be memory Performance 10-90% of system performance may be memory related Power 25-40% of system power may be memory related
6
Issues in Memory Synthesis
Number of distributed registers Number of register files Number of register file ports On-chip or Off-chip memory Cache Parameters Cache Vs Scratch pad Number of memory ports Memory bus Bandwidth Data Organization and Partitioning
7
Optimizations in Memory Synthesis
Code Optimizations R-M-W Mode Clustering of Scalar variables Reordering Hoisting Loop Transformations Memory assignment of array variables Hardware Optimizations Scratch Pad Banking
8
Storing Multi-dimensional Arrays: Row-major
int X [4][4]; Row-major Storage Physical Memory Logical Memory 15
9
Storing Multi-dimensional Arrays: Column-major
int X [4][4]; Column-major Storage Physical Memory Logical Memory 15
10
Storing Multi-dimensional Arrays: Tile-based
int X [4][4]; Tile-based Storage Physical Memory Logical Memory 15
11
Array Layout and Data Cache
a[i] int a [1024]; int b[1024]; int c [1024]; ... for (i = 0; i < N; i++) c [i] = a [i] + b [i]; b b[i] c Data Cache (Direct-mapped, 512 words) c[i] Memory Problem: Every access leads to cache miss
12
Data Alignment a a[i] int a [1024]; int b[1024]; int c [1024]; ...
for (i = 0; i < N; i++) c [i] = a [i] + b [i]; DUMMY b b[i] DUMMY c c[i] Data Cache (Direct-mapped, 512 words) Memory Data alignment avoids cache conflicts
13
Data Layout Transformation
Splitting structs into individual arrays Account for pointer arithmetic, dereferencing Clustering of arrays
14
Motivating Example Arrays Loop 1 Loop 2 struct x { int a; int b;
int q [1000]; ... avg = 0; for (i = 0; i < 1000; i++) avg = avg + p[i].a; avg = avg / 1000; for (i = 0; i < 1000; i++) { p[i].b = p[i].b + avg; q[i] = p[i].b + 1; } Arrays Loop 1 Loop 2
15
Cache Performance: Loop 1
Data Cache [Direct-mapped 4 lines, 2 words/line] struct x { int a; int b; } p [1000]; int q [1000]; ... avg = 0; for (i = 0; i < 1000; i++) avg = avg + p[i].a; avg = avg / 1000; for (i = 0; i < 1000; i++) { p[i].b = p[i].b + avg; q[i] = p[i].b + 1; } Useless Data p[0].a p[0].b Loop 1 p[1].a p[1].b 1 p[2].a p[2].b 2 p[3].a p[3].b 3 Line
16
Cache Performance: Loop 2
Data Cache [Direct-mapped 4 lines, 2 words/line] struct x { int a; int b; } p [1000]; int q [1000]; ... avg = 0; for (i = 0; i < 1000; i++) avg = avg + p[i].a; avg = avg / 1000; for (i = 0; i < 1000; i++) { p[i].b = p[i].b + avg; q[i] = p[i].b + 1; } p[0].a p[0].b p[1].a p[1].b 1 Loop 2 q[0] q[1] 2 3 Line Useless Data
17
Cache Performance 1000 cache misses for p[i].a 1500 cache misses
struct x { int a; int b; } p [1000]; int q [1000]; ... avg = 0; for (i = 0; i < 1000; i++) avg = avg + p[i].a; avg = avg / 1000; for (i = 0; i < 1000; i++) { p[i].b = p[i].b + avg; q[i] = p[i].b + 1; } 1000 cache misses for p[i].a 1500 cache misses 1000 misses for p[i].b 500 misses for q[i] Cache miss rate: 62.5%
18
Transformed Data Layout
struct x { int a; int b; } p [1000]; int q [1000]; struct y { int q; // originally q int b; // originally x.b } r [1000]; int a [1000]; // originally x.a Loop 2 Loop 1 ... avg = 0; for (i = 0; i < 1000; i++) avg = avg + a[i]; avg = avg / 1000; for (i = 0; i < 1000; i++) { r[i].b = r[i].b + avg; r[i].q = r[i].b + 1; }
19
Cache Performance: Loop 1
Data Cache [Direct-mapped 4 lines, 2 words/line] struct y { int q; // originally q int b; // originally x.b } r [1000]; int a [1000]; // originally x.a ... avg = 0; for (i = 0; i < 1000; i++) avg = avg + a[i]; avg = avg / 1000; for (i = 0; i < 1000; i++) { r[i].b = r[i].b + avg; r[i].q = r[i].b + 1; } a[0] a[1] Loop 1 a[2] a[3] 1 2 3 No useless data in cache Line
20
Cache Performance: Loop 2
Data Cache [Direct-mapped 4 lines, 2 words/line] struct y { int q; // originally q int b; // originally x.b } r [1000]; int a [1000]; // originally x.a ... avg = 0; for (i = 0; i < 1000; i++) avg = avg + a[i]; avg = avg / 1000; for (i = 0; i < 1000; i++) { r[i].b = r[i].b + avg; r[i].q = r[i].b + 1; } r[0].q r[0].b r[1].q r[1].b 1 2 Loop 2 3 No useless data in cache Line
21
Cache Performance Cache miss rate: 37.5% 500 cache misses
struct y { int q; // originally q int b; // originally x.b } r [1000]; int a [1000]; // originally x.a ... avg = 0; for (i = 0; i < 1000; i++) avg = avg + a[i]; avg = avg / 1000; for (i = 0; i < 1000; i++) { r[i].b = r[i].b + avg; r[i].q = r[i].b + 1; } 500 cache misses 1000 cache misses Cache miss rate: 37.5%
22
Clustering of Arrays 8 + 16 24 int a[16], b[16], c[16] For i = 0 to 7
a[i] = b[i+3] + 3 For j = 0 to 15 a[i] = b[i] * c[i] a b 16 16 16 c
23
Scratch Pad Memory Data memory residing on chip
Address space disjoint from off-chip memory Same address and data bus as that for off chip memory Guaranteed small access time as no read/write miss
24
Memory Address Space On-chip Memory CPU Off-chip Memory Data Cache
1 1-cycle cycle On-chip Memory CPU P-1 Off-chip Memory P Data Cache (on-chip) Memory Address Space 1-cycle 1 cycle 10-20 cycles 10-20 cycles N-1
25
Scratch Pad Model Organization of scratch pad memory
No comparison is needed A priori knowledge of the memory objects an added advantage Scratch pad memory constitutes the data array unit, decoder unit and the peripheral unit
26
Why Scratchpad? Unordered array variables and scalars lead to a large number of conflict misses in the cache Accesses are data dependent, so data layout techniques are ineffective Example : char BrightnessLevel[512][512] int Hist[256] for i = 0 to 512 for j = 0 to 512 level = Brightnesslevel[I][j] Hist[level] = Hist[level]+1
27
Data Partitioning Minimize the interference between different variables in the data cache Partitioning of variables is governed by the following code characteristics : - scalar variables and constants - size of arrays - life time of variables - access frequency of variables - loop conflicts
28
Access Frequency of Variables and Loop Conflicts
Variable Access Count (VAC) Interference Access Count (IAC) Interference Factor (IF) IF(u) = VAC(u) + IAC(u) Map variables with high IF values into the scratch pad memory Loop Conflict Factor (LCF) Map variables with high LCF number to scratch pad memory
29
Formulation of Partitioning problem
Total Conflict Factor (TCF) TCF(u) = IF(u) + LCF(u) Given a set of n arrays with corresponding TCF values find an optimal subset such that total size <= size of SRAM and total TCF value is maximized Similar to knapsack problem except the fact that several arrays with non intersecting lifetimes can share the same SRAM space
30
Conclusion Z-Buffering - Graphics Stream buffers - Data pre-fetching
Stride Prediction tables - predict memory references Inter-array windowing - multi-dimensional arrays
31
References Books Survey Paper
P. Panda, N. Dutt, A. Nicolau - Memory issues in embedded systems-on-chip: optimization and exploration, Kluwer Academic Publishers, 1999 F. Catthoor, S. Wuytack, E. De Greef, F. Balasa, L. Nachtergaele, A. Vandecappelle – Custom memory management methodology, Kluwer Academic Publishers, 1998o Survey Paper P. Panda, F. Catthoor, N. Dutt, K. Danckaert, E. Brockmeyer, C. Kulkarni, A. Vandecappelle – Data and Memory Optimization Techniques for Embedded Systems, ACM Transactions on Design Automation of Embedded Systems, April 2001
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.