Embedded Computer Architecture

Slides:



Advertisements
Similar presentations
Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow
Advertisements

Main MemoryCS510 Computer ArchitecturesLecture Lecture 15 Main Memory.
Anshul Kumar, CSE IITD CSL718 : Memory Hierarchy Cache Performance Improvement 23rd Feb, 2006.
Lecture 12 Reduce Miss Penalty and Hit Time
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout.
Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Bart Mesman Data Memory Management Part d: Data Layout for Caches.
Embedded Systems in Silicon TD5102 Data Management (1) Overview Henk Corporaal Technical University.
Prof. Bodik CS 164 Lecture 171 Register Allocation Lecture 19.
Register Allocation (via graph coloring)
Register Allocation (via graph coloring). Lecture Outline Memory Hierarchy Management Register Allocation –Register interference graph –Graph coloring.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 30, 2002 Topic: Caches (contd.)
1 Liveness analysis and Register Allocation Cheng-Chia Chen.
Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Bart Mesman Data Memory Management Part b: Loop transformations & Data Reuse.
4/29/09Prof. Hilfinger CS164 Lecture 381 Register Allocation Lecture 28 (from notes by G. Necula and R. Bodik)
Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Data Management Part c: SCBD, MAA, and Data Layout.
CMPE 421 Parallel Computer Architecture
3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems
ASCI Winterschool on Embedded Systems March 2004 Renesse Data Memory Management Henk Corporaal Peter Knijnenburg.
1 Computer System Overview Chapter 1. 2 Operating System Exploits the hardware resources of one or more processors Provides a set of services to system.
Virtual Memory Chapter 8.
Memory Hierarchy Ideal memory is fast, large, and inexpensive
Memory Management.
The 8085 Microprocessor Architecture
Memory COMPUTER ARCHITECTURE
Computing Systems Organization
ARM Organization and Implementation
Reducing Hit Time Small and simple caches Way prediction Trace caches
Random access memory Sequential circuits all depend upon the presence of memory. A flip-flop can store one bit of information. A register can store a single.
Cache Memories CSE 238/2038/2138: Systems Programming
CSC 4250 Computer Architectures
How will execution time grow with SIZE?
The Hardware/Software Interface CSE351 Winter 2013
Optimization Code Optimization ©SoftMoore Consulting.
5.2 Eleven Advanced Optimizations of Cache Performance
Memory Units Memories store data in units from one to eight bits. The most common unit is the byte, which by definition is 8 bits. Computer memories are.
Cache Memory Presentation I
Morgan Kaufmann Publishers Memory & Cache
Chapter 3 Top Level View of Computer Function and Interconnection
CACHE MEMORY.
CSE 153 Design of Operating Systems Winter 2018
Pipelining: Advanced ILP
Lecture: DRAM Main Memory
Lecture 23: Cache, Memory, Virtual Memory
Lecture 22: Cache Hierarchies, Memory
Lecture: DRAM Main Memory
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Adapted from slides by Sally McKee Cornell University
BIC 10503: COMPUTER ARCHITECTURE
How can we find data in the cache?
Guest Lecturer TA: Shreyas Chand
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
Recall: ROM example Here are three functions, V2V1V0, implemented with an 8 x 3 ROM. Blue crosses (X) indicate connections between decoder outputs and.
If a DRAM has 512 rows and its refresh time is 9ms, what should be the frequency of row refresh operation on the average?
Computer Evolution and Performance
ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
Chapter 1 Computer System Overview
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Modified from notes by Saeid Nooshabadi
Memory System Performance Chapter 3
CSE 153 Design of Operating Systems Winter 2019
ECE 352 Digital System Fundamentals
Cache Performance Improvements
Chapter 13: I/O Systems.
Introduction to Optimization
10/18: Lecture Topics Using spatial locality
Platform-based Design
Virtual Memory 1 1.
Presentation transcript:

Embedded Computer Architecture Data Memory Management Overview 5KK73 TU/e Henk Corporaal Bart Mesman

Data Memory Management Overview Motivation Example application DMM steps Results Notes: We concentrate on Static Data structures like arrays The Data Transfer and Storage Exploration (DTSE) methodology, on which these slides are based, has been developed at IMEC, Leuven 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

You want the big picture? Take a well written C-program, which is dominated by: "well structured" loopnests, and by accesses to "static" big (multi-dimensional) arrays Transform your code in order to reduced number of external memory accesses by a factor 10 drastic reduction of energy consumption significant improvement of speed 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman The underlying idea VLIW cpu I$ video-in video-out audio-in audio-out PCI bridge Serial I/O timers I2C I/O SDRAM D$ SDRAM D$ Data storage bottleneck B[j] = A[i*4+k]; for (i=0;i<n;i++) for (j=0; j<3; j++) for (k=1; k<7; k++) B[j] = A[i*4+k]; B[j] = A[i*4+k]; Data transfer bottleneck 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

Platform example: TriMedia 5 out of 27 processor FUs 128*32b 16-port RegFile Hardware accelerators TriMedia (5-issue VLIW processor) 256M 1-port SDRAM 16K 2-port SRAM 256M 1-port SDRAM SW cache 8KB TriMedia cache use HW 8/16KB CPU Cache bypass SW controlled HW controlled CPU: VLIW type 27 Functional Units of which 5 can be active in parallel Supports 4-way sub-word operation “dspuquadaddui” Complex instructions not supported by compiler System bus: 32-bit wide Cache: 16 Kbytes Lines of 64 bytes 3 cycle cache hit 11 cycle read miss 3 cycle write miss controllable, max 50% of cache can be locked on certain (single) address range pseudo dual-ported: 8 banks with each 1 port (adjacent words in different banks) Cache bypass: without a bypass ALL access go though cache. This means that arrays which are only read once get loaded into the cache (where they mess everything up) only to be flushed a short while later never to be loaded in the cache again, since they are only used once in the algorithm In general the bypass will be used for signals with low locality, because they would get loaded into the cache (upon first read) but would not be read again (if at all) before they are flushed. The cost of using the bypass is a higher latency 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

Generic platform model Level-1 Level-2 Level-3 SCSI bus Level-4 Chip bus bus on-chip busses bus-if bridge SCSI Disk L2 Cache ICache CPUs DCache Main Memory Disk HW accel Local Memory Local Memory Disk Local Memory 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

Data transfer and storage power Power(memory) Power(arithmetic) = 33 Cache miss is probably R+W. Bottom line: factor 33 difference in data transfer to external memory w.r.t. arithmetic operation. 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

What about delay of memories? Global wiring delay becomes dominant over gate delay 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

Positioning in the Y-chart Architecture Instance Applications Applications Applications Data transfer and data storage specific rewrites in the application code Mapping Performance Analysis Performance Numbers 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

If you are in a hurry, do this: Given architecture e.g. TriMedia TM1000 reference C code for application e.g. MPEG-4 Motion Estimation Task map application on architecture But … wait a moment me@work> tmcc -o mpeg4_me mpeg4_me.c Thank you for running TriMedia compiler. Your program uses 257321886 bytes, 78 Watt, and 428798765291 clock cycles 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman Let’s help the compiler ... DTSE: data transfer and storage exploration Transforms C-code of the application By focusing on multi-dimensional arrays To better exploit platform capabilities This overview covers the major steps to improve power, area, performance trade-off in the context of platform based design 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman Application example Application domain: Computer Tomography in medical imaging Algorithm: Cavity detection in CT-scans Detect dark regions in successive images Indicate cavity in brain Bad news for owner of brain 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman Application Max Value Gauss Blur x Gauss Blur y Compute Edges Reverse Detect Roots Reference (conceptual) C code for the algorithm all functions: image_in[N x M] -> image_out[N x M] new value of pixel depends on its neighbors neighbor pixels read from background memory approximately 110 lines of C code (ignoring file I/O etc) experiments with e.g. N x M = 640 x 400 pixels straightforward implementation: 6 image buffers 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

Where does this fit in the whole design flow? Dynamic memory mgmt Task concurrency mgmt Static memory mgmt Address optimization SW design flow HW SW/HW co-design Concurrent OO spec Remove OO overhead Design flow 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

DMM (data mem. mgt.) principles Avoid N-port Memories within real-time constraints Cache Bank Combine local latch 1 & bank 1 local latch N & bank N Processor Data Paths L1 mem/ cache L2 Reduce redundant transfers Exploit memory hierarchy Introduce Locality Off-chip SDRAM DTSE is based on analysis of data transfers from/to main memories Reduce/eliminate redundant transfers: (1) don’t fetch data that you do not need and do not produce data that you will not further use (2) eliminate parts of arrays that are never read/written (dead array access elimination). Introduce data locality (such that data consumption and production are close together in time and data can be stored locally in FG memory) Data reuse: exploit memory hierarchy (store data that are frequently accessed in smaller memories closer to the data-paths: less power in smaller memories and less power in transfer) Avoid N-port memories: bad for power and area. You can usually design a memory architecture with (many) 1-port memories that still meets the cycle-budget constraints for the memory accesses. It happens only very rarely that a solution with an N-port memory is superior to one that contains a number of smaller 1-port memories. Exploit limited life-time of array elements 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

DMM steps C-in Preprocessing Dataflow transformations Loop transformations Data reuse Memory hierarchy layer assignment Cycle budget distribution Memory allocation and assignment Data layout Address optimization C-out 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman The DMM steps Preprocessing Rewrite code in 3 layers (parts) Selective inlining, Single Assignment form, .... Data flow transformations Eliminate redundant transfers and storage Loop and control flow transformations Improve regularity of accesses and data locality Data re-use and memory hierarchy layer assignment Determine when to move which data between memories to meet the cycle budget of the application with low cost Determine in which layer to put the arrays (and copies) NOTE on regularity of accesses Assume you have two loops that produce and consume each other’s data and that they do this in a star pattern as shown below: This always means that a buffer is needed. It is much better to reverse one of the loops to get a regular access pattern as shown below, in which we may suffice with a small buffer or maybe even a point solution 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman The DMM steps Per memory layer: Cycle budget distribution determine memory access constraints for given cycle budget Memory allocation and assignment which memories to use, and where to put the arrays Data layout determine how to combine and put arrays into memories Address optimization on the final C-code 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

Preprocessing: Dividing an application in the 3 layers Module1a LAYER1 Module2 Module3 Module1b - testbench call - dynamic event behaviour Synchronisation - mode selection for (i=0;i<N; i++) for (j=0; j<M; j++) if (i == 0) B[i][j] = 1; else B[i][j] = func1(A[i][j], A[i-1][j]); LAYER2 int func1(int a, int b) LAYER3 { return a*b; } 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

Layered code structure: level 1+2 main(){ /* Layer 1 code */ read_image(IN_NAME, image_in); cav_detect(); write_image(image_out); } void cav_detect() { /* Layer 2 code */ for (x=GB; x<=N-1-GB; ++x) { // filtersize 2GB+1 for (y=GB; y<=M-1-GB; ++y) { gauss_x_tmp = 0; for (k=-GB; k<=GB; ++k) { gauss_x_tmp += in_image[x+k][y] * Gauss[abs(k)]; } gauss_x_image[x][y] = foo(gauss_x_tmp); 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

Data-flow trafo - cavity detection M for (x=0; x<N; ++x) for (y=0; y<M; ++y) gauss_x_image[x][y]=0; M-2 N-2 for (x=1; x<=N-2; ++x) { for (y=1; y<=M-2; ++y) { gauss_x_tmp = 0; for (k=-1; k<=1; ++k) { gauss_x_tmp += image_in[x+k][y]*Gauss[abs(k)]; } gauss_x_image[x][y] = foo(gauss_x_tmp); #accesses: N * M + (N-2) * (M-2) 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

Data-flow trafo - cavity detection M N-2 M-2 for (x=0; x<N; ++x) for (y=0; y<M; ++y) if ((x>=1 && x<=N-2) && (y>=1 && y<=M-2)) { gauss_x_tmp = 0; for (k=-1; k<=1; ++k) { gauss_x_tmp += image_in[x+k][y]*Gauss[abs(k)]; } gauss_x_image[x][y] = foo(gauss_x_tmp); } else { gauss_x_image[x][y] = 0; #accesses: N * M gain is almost 50 % 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

Data-flow transformation In total 5 types of data-flow transformations: advanced signal substitution and (copy) propagation algebraic transformations (associativity, distributive law, etc.) shifting “delay lines” re-computation (instead of keeping values alive too long) transformations to eliminate bottlenecks for subsequent loop transformations 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

Data-flow transformation - result 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman Loop transformations Loop transformations improve regularity of accesses improve temporal locality: production  consumption Expected influence reduce temporary storage and (anticipated) background storage for (j=1; j<=M; j++) for (i=1; i<=N; i++) A[i]= foo(A[i]); out[i] = A[i]; for (i=1; i<=N; i++) { for (j=1; j<=M; j++) { A[i] = foo(A[i]); } out[i] = A[i]; storage size N storage size 1 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

Global loop transformation steps applied to cavity detection Make all loop dimensions equal Regularize loop traversal: possibly Y and X loop interchange follow order of input stream Y loop folding and global merging X loop folding and global merging full, global scope regularity nearly complete locality for main signals (arrays) 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

Data enters Cavity Detector row-wise scan device serial scan Buffer = image_in GaussBlur loop Cavity Detector 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

Loop trafo - cavity detection N x M Gauss Blur x N x M Scanner X X-Y Loop Interchange Y From double buffer to single buffer 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

Loop interchange (Y  X) for (x=0;x<N;x++) for (y=0;y<M;y++) /* filtering code */ for (y=0;y<M;y++) for (x=0;x<N;x++) /* filtering code */ Not always possible; check dependences For all loops, to maintain regularity 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

Loop trafo - cavity detection N x (2GB+1) N x 3 Gauss Blur x Gauss Blur y Compute Edges Repeated fold and loop merge 3(offset arrays) 2GB+1 From N x M to N x (2GB+1) buffer size From N x M to N x (3) buffer size 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

Improve regularity and locality  Loop Merging for (y=0;y<M;y++) for (x=0;x<N;x++) /* 1st filtering code */ /* 2nd filtering code */ for (y=0;y<M;y++) for (x=0;x<N;x++) /* 1st filtering code */ /* 2nd filtering code */ !! Often impossible due to dependencies! 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

Data dependencies between 1st and 2nd loop for (y=0;y<M;y++) for (x=0;x<N;x++) … gauss_x_image[x][y] = … for (k=-GB; k<=GB; k++) … = … gauss_x_image[x][y+k] … 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

Enable merging with Loop Folding (bumping) for (y=0;y<M;y++) for (x=0;x<N;x++) … gauss_x_image[x][y] = … for (y=0+GB;y<M+GB;y++) … y-GB … for (k=-GB; k<=GB; k++) … gauss_x_image[x][y+k-GB] … 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

Y-loop merging result of 1st and 2nd loop nest for (y=0;y<M+GB;y++) if (y<M) for (x=0;x<N;x++) … gauss_x_image[x][y] = … if (y>=GB) if (x>=GB && x<=N-1-GB && (y-GB)>=GB && (y-GB)<=M-1-GB) … for (k=-GB; k<=GB; k++) … gauss_x_image[x][y-GB+k] … else 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

Simplify conditions in merged loop for (y=0;y<M+GB;y++) for (x=0;x<N;x++) if (y<M) … gauss_x_image[x][y] = … if (y>=GB && x>=GB && x<=N-1-GB && (y-GB)>=GB && (y-GB)<=M-1-GB) … for (k=-GB; k<=GB; k++) … gauss_x_image[x][y-GB+k] … else if (y>=GB) 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

Global loop merging/folding steps 1 x  y Loop interchange (done) 2 Global y-loop folding/merging: 1st and 2nd nest (done) 3 Global y-loop folding/merging: 1st/2nd and 3rd nest 4 Global y-loop folding/merging: 1st/2nd/3rd and 4th nest 5 Global x-loop folding/merging: 1st and 2nd nest 6 Global x-loop folding/merging: 1st/2nd and 3rd nest 7 Global x-loop folding/merging: 1st/2nd/3rd and 4th nest 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

End result of global loop trafo for (y=0; y<M+GB+2; ++y) { for (x=0; x<N+2; ++x) { … if (x>=GB && x<=N-1-GB && (y-GB)>=GB && (y-GB)<=M-1-GB) { gauss_xy_compute[x][y-GB][0] = 0; for (k=-GB; k<=GB; ++k) gauss_xy_compute[x][y-GB][GB+k+1] = gauss_xy_compute[x][y-GB][GB+k] + gauss_x_image[x][y-GB+k] * Gauss[abs(k)]; gauss_xy_image[x][y-GB] = gauss_xy_compute[x][y-GB][(2*GB)+1]/tot; } else if (x<N && (y-GB)>=0 && (y-GB)<M) gauss_xy_image[x][y-GB] = 0; 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

Loop transformations - result 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

Data re-use & memory hierarchy P = 0.01 M’ P = 0.1 M Main memory P = 1 Processor Data Paths Reg File 100 #A = 100 10 1 P (original) = # access x power/access = 100 P (after) = 100 x 0.01 + 10 x 0.1 + 1 x 1 = 3 Introduce memory hierarchy reduce number of reads from main memory heavily accessed arrays stored in smaller memories 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman Data re-use Data flow transformations to introduce extra copies of heavily accessed signals Step 1: figure out data re-use possibilities Step 2: calculate possible gain Step 3: decide on data assignment to memory hierarchy int[2][6] A; for (h=0; h<N; h++) for (i=0; i<2; i++) for (j=0; j<3; j++) for (k=0; k<6; k++) B[j] = A[i][k]; iterations array index (6 * i + k) From CPU you read one access for A. This access you will get from the lowest hierarchy level. This however, has size 1*1 so it has to always fetch the actual data from the next higher level. That level has a capacity to store 6 values, so as an intermediate level it makes sense. Occasionally (every 7th iteration) this layer will also not hold the correct data, so it needs to fetch it from the next higher level …. Actually A is a big array in the application and in this step we focus on figuring out whether we can make a local copy of some part of it. Please note the axes in the graph. The vertical one gives the array index. For i = 0, the range is between 1 and 6. For i = 1, the range is between 7 and 12 (A[1][1] = A[1*6+1] = A[7] and A[1][6] = A[1*6 + 6] = 12 etc …). 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman Data re-use Data flow transformations to introduce extra copies of heavily accessed signals Step 1: figure out data re-use possibilities Step 2: calculate possible gain Step 3: decide on data assignment to memory hierarchy 1*2*1*6 N*2*3*6 CPU iterations frame1 frame2 frame3 array index 6*2 From CPU you read one access for A. This access you will get from the lowest hierarchy level. This however, has size 1*1 so it has to always fetch the actual data from the next higher level. That level has a capacity to store 6 values, so as an intermediate level it makes sense. Occasionally (every 7th iteration) this layer will also not hold the correct data, so it needs to fetch it from the next higher level …. N*2*1*6 6*1 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman Data re-use tree image_in image_out gauss_x gauss_xy/comp_edge N*M N*M M*3 M*3 M*3 N*M N*M*3 N*M*3 N*M N*1 1*1 1*3 3*3 N*M N*M*3 N*M*8 N*M*8 3*1 N*M*3 CPU CPU CPU CPU CPU 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

Memory hierarchy assignment image_in image_out gauss_x gauss_xy comp_edge L2 N*M N*M 1MB SDRAM N*M L1 M*3 M*3 M*3 16KB Cache N*M*3 N*M*3 N*M N*M*3 N*M 128 B RegFile L0 1*1 1*1 3*1 3*3 3*3 N*M*3 N*M*8 N*M*8 N*M*8 N*M*8 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

Data-reuse - cavity detection code Code before reuse transformation for (y=0; y<M+3; ++y) { for (x=0; x<N+2; ++x) { if (x>=1 && x<=N-2 && y>=1 && y<=M-2) { gauss_x_tmp = 0; for (k=-1; k<=1; ++k) gauss_x_tmp += image_in[x+k][y]*Gauss[abs(k)]; gauss_x_image[x][y] = foo(gauss_x_compute); } else { if (x<N && y<M) gauss_x_lines[x][y] = 0; } /* Other merged code omitted … */ } 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

Data-reuse - cavity detection code Code after reuse transformation: for (y=0; y<M+3; ++y) { for (x=0; x<N+2; ++x) { /* first in_pixels initialized */ if (x==0 && y>=1 && y<=M-2) for (k=0; k<1; ++k) in_pixels[(x+k)%3] = image_in[x+k][y]; /* copy rest of in_pixels in row */ if (x>=0 && x<=N-2 && y>=1 && y<=M-2) in_pixels[(x+1)%3] = image_in[x+1][y]; if (x>=1 && x<=N-1-1 && y>=1 && y<=M-2) { gauss_x_tmp=0; for (k=-1; k<=1; ++k) gauss_x_tmp += in_pixels[(x+k)%3]*Gauss[Abs(k)]; gauss_x_lines[x][y%3]= foo(gauss_x_tmp); } else if (x<N && y<M) gauss_x_lines[x][y%3] = 0; 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

Data reuse & memory hierarchy 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

Data layout optimization At this point multi-dimensional arrays are to be assigned to physical memories Data layout optimization determines exactly where in each memory an array should be placed, to reduce memory size by “in-placing” arrays that do not overlap in time (disjoint lifetimes) to avoid cache misses due to conflicts exploit spatial locality of the data in memory to improve performance of e.g. page-mode memory access sequences 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman In-place mapping A B C D E Inter in-place B C D A Both intra+inter C A D B E B C A D E addresses E Intra in-place time 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman In-place mapping Implements all the “anticipated” memory size savings obtained in previous steps Modifies code to introduce one array per “real” memory Changes indices to addresses in mem. arrays 0x0 0x28a0 b8 A[100][100]; b6 B[20][20]; for (i,j,k,l; …) B[i][j] = f(B[j][i], A[i+k][j+l]); A b8 mem1[10400]; for (i,j,k,l; …) mem1[10000+i+20*j] = f(mem1[10000+j+20*i], b6(mem1[i+k+100*(j+l)]); B 0x2710 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman In-place mapping Input image is partly consumed, and not reused again, by the time first results for output image are ready index time address Image Image_in time index Image_out time 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

In-place - cavity detection code for (y=0; y<=M+3; ++y) { for (x=0; x<N+5; ++x) { image_out[x-5][y-3] = …; /* code removed */ … = image_in[x+1][y]; } for (y=0; y<=M+3; ++y) { for (x=0; x<N+5; ++x) { image[x-5][y-3] = …; /* code removed */ … = image [x+1][y]; } 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

In-place mapping - results 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

The last step: ADOPT (Address OPTimization) Increased execution time introduced by DTSE Complicated address arithmetic (modulo!) Additional complex control flow Additional transformations needed to Simplify control flow Simplify address arithmetic: common sub-expression elimination, modulo expansion, … Match remaining expressions on target machine 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman ADOPT principles Example: Full-search Motion Estimation for (i=- 8; i<=8; i++) { for (j=- 4; j<=3; j++) { for (k=- 4; k<=3; k++) A[((208+i)*257+8+j)*257+ 16+i+k] = B[(8+j)*257+16+i+k]; } dist += A[3096] - B[((208+i)*257+4)*257+ 16+i-4]; } for (i=-8; i<=8; i++) { for (j=- 4; j<=3; j++) { for (k=- 4; k<=3; k++) { B[ ] = A[ ]; }} dist += A[ ]- B[ ]; } cse1 = (33025*i+6869616)*2; cse3 = 1040+i; cse4 = j*257+1032; cse5 = k+cse4; cse5+cse1 = cse5+cse3 3096 cse1 Algebraic transformations at word-level 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

Address optimization - result 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

Fixing platform parameters Assume configurable on-chip memory hierarchy Trade-off power versus cycle-budget power [mW] 25 20 15 10 5 storage cycle budget 50,000 100,000 150,000 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman Conclusion Many applications use large (static) data structures Access and layout of this data can be heavily optimized Compilers don't do this Source code (C-to-C) transformations needed !! Showed systematic approach Platform independent high-level transformations Platform dependent transformations exploit platform characteristics (optimal use of cache, …) Substantial energy, memory size (cost) and performance improvements MPEG-4, OFDM, H.263, ADSL, ... 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman