Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Bart Mesman Data Memory Management Part b: Loop transformations & Data Reuse.

Slides:

Advertisements

Similar presentations

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.

Advertisements

1 Optimizing compilers Managing Cache Bercovici Sivan.

Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Memory Management Part b: Loop transformations & Data Reuse.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Compiler Challenges for High Performance Architectures

1 HYRISE – A Main Memory Hybrid Storage Engine By: Martin Grund, Jens Krüger, Hasso Plattner, Alexander Zeier, Philippe Cudre-Mauroux, Samuel Madden, VLDB.

Embedded Systems in Silicon TD5102 Data Management (3) SCBD, MAA, and Data Layout Henk Corporaal Technical.

Multiobjective VLSI Cell Placement Using Distributed Simulated Evolution Algorithm Sadiq M. Sait, Mustafa I. Ali, Ali Zaidi.

Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman Data Memory Management Part b: Loop transformations.

Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout.

Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.

 Data copy forms part of an auto-tuning compiler framework.  Auto-tuning compiler, while using the library, can empirically evaluate the different implementations.

Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Bart Mesman Data Memory Management Part d: Data Layout for Caches.

Embedded Systems in Silicon TD5102 Data Management (1) Overview Henk Corporaal Technical University.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

Memory Management 2010.

Embedded Systems in Silicon TD5102 Data Management (2) Loop transformations & Data reuse Henk Corporaal

Memory Organization.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral

A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 30, 2002 Topic: Caches (contd.)

High Performance Embedded Computing © 2007 Elsevier Lecture 11: Memory Optimizations Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

4/29/09Prof. Hilfinger CS164 Lecture 381 Register Allocation Lecture 28 (from notes by G. Necula and R. Bodik)

Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Data Management Part c: SCBD, MAA, and Data Layout.

- 1 -  P. Marwedel, Univ. Dortmund, Informatik 12, 2003 Universität Dortmund Actual design flows and tools.

Systems I Locality and Caching

Applying Data Copy To Improve Memory Performance of General Array Computations Qing Yi University of Texas at San Antonio.

ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

CSc 453 Final Code Generation Saumya Debray The University of Arizona Tucson.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Experiences with Enumeration of Integer Projections of Parametric Polytopes Sven Verdoolaege, Kristof Beyls, Maurice Bruynooghe, Francky Catthoor Compiler.

ASCI Winterschool on Embedded Systems March 2004 Renesse Data Memory Management Henk Corporaal Peter Knijnenburg.

Novel Algorithms in the Memory Management of Multi-Dimensional Signal Processing Florin Balasa University of Illinois at Chicago.

Lecture 15- Parallel Databases (continued) Advanced Databases Masood Niazi Torshiz Islamic Azad University- Mashhad Branch

Séminaire COSI-Roscoff’011 Séminaire COSI ’01 Power Driven Processor Array Partitionning for FPGA SoC S.Derrien, S. Rajopadhye.

OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

A Programmable Single Chip Digital Signal Processing Engine MAPLD 2005 Paul Chiang, MathStar Inc. Pius Ng, Apache Design Solutions.

Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

Optimizing for the Memory Hierarchy Topics Impact of caches on performance Memory hierarchy considerations Systems I.

UT-Austin CART 1 Mechanisms for Streaming Architectures Stephen W. Keckler Computer Architecture and Technology Laboratory Department of Computer Sciences.

Vector computers.

† Dept. Computer Science and Engineering The Pennsylvania State University ‡ IMEC, Belgium Estimating Influence of Data Layout Optimizations on SDRAM Energy.

Buffering Techniques Greg Stitt ECE Department University of Florida.

1 Lecture 5a: CPU architecture 101 boris.

CS161 – Design and Architecture of Computer

Cache Memories.

CS161 – Design and Architecture of Computer

Cache Memories CSE 238/2038/2138: Systems Programming

CS4961 Parallel Programming Lecture 11: Data Locality, cont

The Hardware/Software Interface CSE351 Winter 2013

Modeling of Digital Systems

Loop Restructuring Loop unswitching Loop peeling Loop fusion

Architecture Background

Morgan Kaufmann Publishers Memory & Cache

CS 105 Tour of the Black Holes of Computing

Embedded Computer Architecture

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

Florin Balasa University of Illinois at Chicago

Register Pressure Guided Unroll-and-Jam

Henk Corporaal TUEindhoven 2011

Virtual Memory Overcoming main memory size limitation

Spring 2008 CSE 591 Compilers for Embedded Systems

Main Memory Background

Cache Performance Improvements

Optimizing single thread performance

Platform-based Design

Presentation transcript:

Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Bart Mesman Data Memory Management Part b: Loop transformations & Data Reuse

Thanks to the IMEC DTSE experts: Erik Brockmeyer IMEC, Leuven, Belgium and also Martin Palkovic, Sven Verdoolaege, Tanja van Achteren, Sven Wuytack, Arnout Vandecappelle, Miguel Miranda, Cedric Ghez, Tycho van Meeuwen, Eddy Degreef, Michel Eyckmans, Francky Catthoor, e.a.

@HC 5KK73 Embedded Computer Architecture3 DM methodology Dataflow Transformations Analysis/Preprocessing Loop/control-flow transformations Data Reuse Storage Cycle Budget Distribution Memory Allocation and Assignment Memory Layout organisation C-out C-in Address optimization

@HC 5KK73 Embedded Computer Architecture4 for (i=0; i < 8; i++) A[i] = …; for (i=0; i < 8; i++) B[7-i] = f(A[i]); Location Time Production Consumption for (i=0; i < 8; i++) A[i] = …; B[7-i] = f(A[i]); Location Time Production Consumption Locality of Reference

@HC 5KK73 Embedded Computer Architecture5 Regularity for (i=0; i < 8; i++) A[i] = …; for (i=0; i < 8; i++) B[i] = f(A[7-i]); Location Time for (i=0; i < 8; i++) A[i] = …; for (i=0; i < 8; i++) B[7-i] = f(A[i]); Location Time Production Consumption

@HC 5KK73 Embedded Computer Architecture6 for (i=0; i < 8; i++) B[i] = f1(A[i]); for (i=0; i < 8; i++) C[i] = f2(A[i]); Location Time Consumption Location Time Consumption Enabling Reuse for (i=0; i < 8; i++) B[i] = f1(A[i]); C[i] = f2(A[i]);

@HC 5KK73 Embedded Computer Architecture7 How to do these loop transformations automatically? Requires cost function Requires technique Let's introduce some terminology - iteration spaces - polytopes - ordering vector, which determines the execution order

@HC 5KK73 Embedded Computer Architecture8 01 j i Iteration space and polytopes // assume A[][] exists for (i=1; i<6; i++) { for (j=2; j<6; j++) { B[i][j] = g( A[i-1][j-2]); } --- iteration space --- consumption space --- production space --- dependency vector Polytope BPolytope A

@HC 5KK73 Embedded Computer Architecture9 Example with 3 polytopes A: for (i=1; i<=N; ++i) for (j=1; j<=N-i+1; ++j) a[i][j] = in[i][j] + a[i-1][j]; B: for (p=1; p<=N; ++p) b[p][1] = f( a[N-p+1][p], a[N-p][p] ); C: for (k=1; k<=N; ++k) for (l=1; l<=k; ++l) b[k][l+1] = g (b[k][l]); A B C Algorithm having 3 loops: j i k p l

@HC 5KK73 Embedded Computer Architecture10 Common iteration space for (i=1; i<=(2*N+1); ++i) for (j=1; j<=2*N; ++j) if (i>=1 && i =1 && j<=N-i+1) a[i][j] = in[i][j] + a[i-1][j]; if (i==N+1 && j>=1 && j<=N) b[j][1] = f( a[N-j+1][j], a[N-j][j] ); if (i>=N+2 && i<=2*N+1 && j>=N+1 && j<=N+k) b[i-N-1][j-N+1] = g (b[i-N-1][j-N]); j i 1 2*N+1 12*N Initial solution having a common iteration space: Bad locality Bad regularity Requires 2N memory locations Many dummy iterations Ordering vector

@HC 5KK73 Embedded Computer Architecture11 Cost function needed for automation Regularity Equal direction for dependency vectors Avoid that dependency vectors cross each other Good for storage size Temporal locality Equal length of all dependency vectors Good for storage size Good for data reuse

@HC 5KK73 Embedded Computer Architecture12 Regularity Regular Irregular

@HC 5KK73 Embedded Computer Architecture13 Bad regularity limits the ordering freedom j i 1 2*N+1 12*N Ordering freedom = 90 degrees

@HC 5KK73 Embedded Computer Architecture14 Locality estimates: a few options P C C C C P C C C C P = production C = consumption P C C C C C Dependency vector length is measure for locality Q: Which length is the best estimate? Sum{d i } Max {d i }Spanning tree didi

@HC 5KK73 Embedded Computer Architecture15 1.Affine loop transformations Rotation, skewing, interchange, reverse Only geometric information is needed 2.Polytope placement Translation Only geometric information is needed 3.Choose ordering vector Generate the code Three step approach for loop transformation tool Combined transformation:

@HC 5KK73 Embedded Computer Architecture16 A: (i: 1..N):: (j: 1.. N-i+1):: a[i][j] = in[i][j] + a[i-1][j]; C: (k: 1..N):: (l: 1..k):: b[N-k+1][l+1] = g( b[N-k+1][l] ); B: (p: 1..N):: b[p][1] = f( a[N-p+1][p], a[N-p][p] ); 1. Affine loop transformations 2. Polytope placement 3. Choose ordering vector Three step approach for loop transformation tool

@HC 5KK73 Embedded Computer Architecture17 Three step approach for loop transformation tool 1. Affine loop transformations 2. Polytope placement 3. Choose ordering vector

@HC 5KK73 Embedded Computer Architecture18 Three step approach for loop transformation tool 1. Affine loop transformations 2. Polytope placement = merging loops 3. Choose ordering vector

@HC 5KK73 Embedded Computer Architecture19 Choose optimal ordering vector Ordering Vector 1 Ordering Vector 2

@HC 5KK73 Embedded Computer Architecture20 From the Polyhedral model back to C for (j=1; j<=N; ++j) { for (i=1; i<=N-j+1; ++i) a[i][j] = in[i][j] + a[i-1][j]; b[j][1] = f( a[N-j+1][j], a[N-j][j] ); for (l=1; l<=j; ++l) b[j][l+1] = g( b[j][l] ); } 1. Affine loop transformations 2. Polytope placement 3. Choose ordering vector Optimized solution having a common iteration space: Optimal locality Optimal regularity Requires 2 memory locations

@HC 5KK73 Embedded Computer Architecture21 Scanner Loop trafo - cavity detection Gauss Blur y Gauss Blur x N x M X-Y Loop Interchange N x M From N x M to N x (2GB+1) buffer size X Y N x M

@HC 5KK73 Embedded Computer Architecture22 Loop trafo- cavity (1) 1 Transform: interchange 2 Translate: merge 3 Order

@HC 5KK73 Embedded Computer Architecture23 Loop trafo- cavity (2) 1 Transform: interchange 2 Translate: merge 3 Choose Order x-blur filter:

@HC 5KK73 Embedded Computer Architecture24 Scanner Loop trafo - cavity detection Gauss Blur y Gauss Blur x N x M · X-Y Loop Interchange N x M From N x M to N x (2GB+1) buffer size X Y N x M

@HC 5KK73 Embedded Computer Architecture25 Loop trafo- cavity (3) 2 Translate 1: 2 Translate 2: 3 Comparing different translations

@HC 5KK73 Embedded Computer Architecture26 Loop trafo- cavity (4) 3 3 Order += Combining (merging) multiple polytopes

@HC 5KK73 Embedded Computer Architecture27 Result on gauss filter for (y=0; y<M+GB; ++y) { for (x=0; x<N+GB; ++x) { if (x>=GB && x =GB && y<=M-1-GB) { gauss_x_compute = 0; for (k=-GB; k<=GB; ++k) gauss_x_compute += image_in[x+k][y]*Gauss[abs(k)]; gauss_x_image[x][y] = gauss_x_compute/tot; } else if (x<N && y<M) gauss_x_image[x][y] = 0; if (x>=GB && x =GB && (y-GB)<=M-1-GB) { gauss_xy_compute = 0; for (k=-GB; k<=GB; ++k) gauss_xy_compute += gauss_x_image[x][y-GB+k]* Gauss[abs(k)]; gauss_xy_image[x][y-GB] = gauss_xy_compute/tot; } else if (x =0 && (y-GB)<M) gauss_xy_image[x][y-GB] = 0;

@HC 5KK73 Embedded Computer Architecture28 Intermezzo Before we continue with data reuse, have a look at other loop transformations check the discussed slides !!

@HC 5KK73 Embedded Computer Architecture29 DM methodology Dataflow Transformations Analysis/Preprocessing Loop/control-flow transformations Data Reuse Storage Cycle Budget Distribution Memory Allocation and Assignment Memory Layout organisation C-out C-in Address optimization

@HC 5KK73 Embedded Computer Architecture30 Layer 1 Layer 2 Layer 3 Data paths Memory hierarchy and Data reuse 1. Determines reuse candidates 2. Combine reuse candidates into reuse chains 3. If multiple access statements/array combine into reuse trees 4. Determine number of layers (if architecture is not fixed) 5. Select candidates and assign to memory layers 6. Add extra transfers between the different memory layers (for scratchpad RAM; not for caches)

@HC 5KK73 Embedded Computer Architecture31 TI example platform Register file + Core 4Kx16 dual 32x Total256Kb 1 elem in 1 cycle 16Kx16 ROM Offchip MAX: 8MBx16 SRAM/EPROM/ SDRAM/SBSRAM Vdd= 1.5 V P = unknown 8x Total64Kb 2 elem in 1 cycle 4Kx16 dual 4Kx16 dual 4Kx16 sing 4Kx16 sing 4Kx16 sing ROM (Data/program/DMA) first 3 cycles, next 2 cycles It seems this can be in parallel with the 256Kb memory Bandwidth 100M words/S Bandwidth 400M words/s Size 32kB Size 320kB ROM partition Variable size RAM partition Bandwidth 50M words/s Size 16 MB Fixed size RAM partition Bandwidth 4.8Gwords/s Size 2x16 registers Processor partition BW: 50M Word/s single port L2 L0 L1 BW: 400M Word/s dual port

@HC 5KK73 Embedded Computer Architecture32 M P = 1 Exploiting Memory Hierarchy for reduced Power: principle Processor Data Paths Register File Processor Data Paths Register File A P = 1 #A = 100% P total (before) = 100%

@HC 5KK73 Embedded Computer Architecture33 P total (before) = 100% M P = 1 A A’ P = % 5% Exploiting Memory Hierarchy for reduced Power: principle P total (after) = 100%x %x0.1+1%x1 = 3% M P = 1 A A’ P = 0.1 A’’ P = % 1% 10% Processor Data Paths Register File Processor Data Paths Register File

@HC 5KK73 Embedded Computer Architecture34 M Data reuse decision and memory hierarchy: principle Processor Data Paths Register File Processor Data Paths Register File BABA A’A’’ customized connections Customized connections in the memory subsystem to bypass the memory hierarchy and avoid the overhead.

@HC 5KK73 Embedded Computer Architecture35 Step 1: identify arrays with data reuse potential for (i=0; i<4; i++) for (j=0; j<3; j++) for (k=0; k<6; k++) … = A[i*4+k]; time copy3 copy4 copy1 copy2 Time frame 1Time frame 2Time frame 3Time frame 4 array index intra-copy reuse inter-copy reuse

@HC 5KK73 Embedded Computer Architecture36 Importance of high level cost estimate for (i=0; i<4; i++) for (j=0; j<3; j++) for (k=0; k<6; k++) … = A[i*4+k]; time copy3 copy4 copy1 copy2 Time frame 1Time frame 2Time frame 3Time frame 4 array index 6 Mk Array copies are stored in-place!

@HC 5KK73 Embedded Computer Architecture37 Step 1: determine gains Intra-copy reuse factor for (i=0; i<4; i++) for (j=0; j<3; j++) for (k=0; k<6; k++) … = A[i*4+k]; time copy3 copy4 copy1 copy2 Time frame 1Time frame 2Time frame 3Time frame 4 array index 6 Mk intra-copy reuse factor= 3 j iterator =not present so intra-copy reuse 3

@HC 5KK73 Embedded Computer Architecture38 Step 1: determine gains Inter-copy reuse factor time copy3 copy4 copy1 copy2 Time frame 1Time frame 2Time frame 3Time frame 4 array index inter-copy reuse factor = 1/(1-1/3)=3/2 6 Mk for (i=0; i<n; i++) for (j=0; j<3; j++) for (k=0; k<6; k++) … = A[i*4+k]; for (i=0; i<4; i++) for (j=0; j<3; j++) for (k=0; k<6; k++) … = A[i*4+k]; i iterator has smaller weight than k range so inter-copy reuse

@HC 5KK73 Embedded Computer Architecture39 5 Mm tf 1tf 2tf 3tf 4tf 5tf 6tf 7tf 8tf 9 Possibility for multi-level hierarchy array index time for (i=0; i<10; i++) for (j=0; j<2; j++) for (k=0; k<3; k++) for (l=0; l<3; l++) for (m=0; m<5; m++) … = A[i*15+k*5+m]; Mk 15 time frame 1time frame 2 5 Mm tf 1.1tf 1.2tf 1.3tf 1.4tf 1.5tf 1.6tf 2.1tf 2.2tf 2.3

@HC 5KK73 Embedded Computer Architecture40 Step 2: determine data reuse chains for each memory access R1(A) A A’ R1(A) A A’ R1(A) A A’ A’’ Many reuse possibilities Cost estimate needed Prune for promising ones R1(A) A

@HC 5KK73 Embedded Computer Architecture41 Cost function needs both size and number of accesses to intermediate array for (i=0; i<10; i++) for (j=0; j<2; j++) for (k=0; k<3; k++) for (l=0; l<3; l++) for (m=0; m<5; m++) … = A[i*15+k*5+m]; Gk 15 5 Gm estimate #misses from different levels for one iteration of i R1(A) 2*3*3*5 =90 A’ 3*5 =15 A’ 2*3*5 =30 estimate size # elements #misses

@HC 5KK73 Embedded Computer Architecture42 R1(A) A A’ R1(A) A A’ R1(A) A A’ A’’ R1(A) A Very simplistic power and area estimation for different data-reuse versions x y z accesses size energy

@HC 5KK73 Embedded Computer Architecture43 R1(A) A A’ A’’ for (i=0; i<10; i++) for (j=0; j<2; j++) for (k=0; k<3; k++) for (l=0; l<3; l++) for (m=0; m<5; m++) … = A[i*15+k*5+m]; Step 3: determine data reuse trees for multiple accesses R2(A) A A’ for (x=0; x<8; x++) for (y=0; y<5; y++) … = A[i*5+y];

@HC 5KK73 Embedded Computer Architecture44 R1(A) A A’ A’’ R2(A) A A’ Reuse tree A R1(A) A’ A’’ R2(A) A’ Step 3: determine data reuse trees for multiple accesses

@HC 5KK73 Embedded Computer Architecture45 Step 4: Determine number of layers Data reuse trees A Data reuse trees B Hierarchy layers Layer1 Layer2 Layer3 Foreground mem. Datapath

@HC 5KK73 Embedded Computer Architecture46 Step 5: Select and assign reuse candidates Data reuse trees Hierarchy layers hierarchy assignments FG A A 4 A 5 all

@HC 5KK73 Embedded Computer Architecture47 Step 5: All freedom in array to memory hierarchy Data reuse trees A Hierarchy layers Data reuse trees B

@HC 5KK73 Embedded Computer Architecture48 Step 5: Prune reuse graph (platform independent) Hierarchy layers Full freedom Hierarchy layers Pruned Quite some solutions never make sense

@HC 5KK73 Embedded Computer Architecture49 Step 5: Prune reuse graph further (platform dependent) Hierarchy layers Pruned FG Final solution 4 layer platform A B B' A' FG Final solution 4 layer platform

@HC 5KK73 Embedded Computer Architecture50 Assign all data reuse trees (multiple arrays) to memory hierarchy A R1(A) A’ A’’ R2(A) A’ R1(B) B B’ B’’ B’’’ Layer 1 Layer 2 Layer 3 A R1(A) A’ A’’ R2(A) A’ R1(B) B B’ B’’’

@HC 5KK73 Embedded Computer Architecture51 Data Reuse on 1D horizontal convolution How to make explicit copies? init buffer reuse data new data Image NxM, traversed row order

@HC 5KK73 Embedded Computer Architecture52 int in[H][W+8], out[H][W]; const int c[] = {1,0,1,2,2,1,0,1}; for (r=0; r < H; r++) for (c=0; c < W; c++) for (dc=0; dc < 8; dc++) out[r][c] += in[r][c+dc]*c[dc]; int in[H][W+8], out[H][W], buf[8]; const int c[] = {1,0,1,2,2,1,0,1}; for (r=0; r < H; r++) for (i=0; i<7; i++) buf[i]=in[r][i]; for (c=0; c < W; c++) buf[(c+7)%8] = in[r][c+7]; for (dc=0; dc < 8; dc++) out[r][c] += buf[(c+dc)%8]*c[dc]; Introducing 1D reuse buffer Reuse Factor =7intermediate level decl. additional copyinitial copyreread from buffer

@HC 5KK73 Embedded Computer Architecture53 Introducing line buffers for vertical filtering whole image size[N][M] set of lines [2GB+1] Why keep the whole image in that case? [N]

@HC 5KK73 Embedded Computer Architecture54 Simplified “reuse script” 1. Identify arrays with sufficient reuse potential 2. Determine reuse chains and prune these (for every array read) 3. Determine reuse trees and prune these (for every array) 4. Determine reuse graph including bypasses and prune (for entire application) 5. Determine memory hierarchy layout assignment incorporating given background memory restrictions (layers) and real-time constraints 6. Introduce copies in code: init, update, use code For scratchpad memories only For caches we need a different approach

@HC 5KK73 Embedded Computer Architecture55 Data re-use trees: cavity detector N*M N*1 3*1 image_in N*3 1*3 gauss_x N*3 3*3 gauss_xy/comp_edge N*3 1*1 N*M*3 N*M N*M*3 N*M image_out 0 N*M*8 ¸ CPU Array reads: Array write:

@HC 5KK73 Embedded Computer Architecture56 Memory hierarchy assignment: cavity detector N*M 3*1 image_in N*3 gauss_x gauss_xycomp_edgeimage_out 3*3 1*1 3*3 1*1 L1 N*M N*M*3 N*M 0 N*M*3 N*M N*M*3N*M*8 N*3 L2 L0 1MB SDRAM 16KB Cache 128 B RegFile ¸

@HC 5KK73 Embedded Computer Architecture57 Data reuse & memory hierarchy (to external memory)