Presentation is loading. Please wait.

Presentation is loading. Please wait.

Embedded Computer Architecture

Similar presentations


Presentation on theme: "Embedded Computer Architecture"— Presentation transcript:

1 Embedded Computer Architecture
Data Memory Management Overview 5KK73 TU/e Henk Corporaal Bart Mesman

2 Data Memory Management Overview
Motivation Example application DMM steps Results Notes: We concentrate on Static Data structures like arrays The Data Transfer and Storage Exploration (DTSE) methodology, on which these slides are based, has been developed at IMEC, Leuven 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

3 You want the big picture?
Take a well written C-program, which is dominated by: "well structured" loopnests, and by accesses to "static" big (multi-dimensional) arrays Transform your code in order to reduced number of external memory accesses by a factor 10 drastic reduction of energy consumption significant improvement of speed 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

4 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman
The underlying idea VLIW cpu I$ video-in video-out audio-in audio-out PCI bridge Serial I/O timers I2C I/O SDRAM D$ SDRAM D$ Data storage bottleneck B[j] = A[i*4+k]; for (i=0;i<n;i++) for (j=0; j<3; j++) for (k=1; k<7; k++) B[j] = A[i*4+k]; B[j] = A[i*4+k]; Data transfer bottleneck 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

5 Platform example: TriMedia
5 out of 27 processor FUs 128*32b 16-port RegFile Hardware accelerators TriMedia (5-issue VLIW processor) 256M 1-port SDRAM 16K 2-port SRAM 256M 1-port SDRAM SW cache 8KB TriMedia cache use HW 8/16KB CPU Cache bypass SW controlled HW controlled CPU: VLIW type 27 Functional Units of which 5 can be active in parallel Supports 4-way sub-word operation “dspuquadaddui” Complex instructions not supported by compiler System bus: 32-bit wide Cache: 16 Kbytes Lines of 64 bytes 3 cycle cache hit 11 cycle read miss 3 cycle write miss controllable, max 50% of cache can be locked on certain (single) address range pseudo dual-ported: 8 banks with each 1 port (adjacent words in different banks) Cache bypass: without a bypass ALL access go though cache. This means that arrays which are only read once get loaded into the cache (where they mess everything up) only to be flushed a short while later never to be loaded in the cache again, since they are only used once in the algorithm In general the bypass will be used for signals with low locality, because they would get loaded into the cache (upon first read) but would not be read again (if at all) before they are flushed. The cost of using the bypass is a higher latency 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

6 Generic platform model
Level-1 Level-2 Level-3 SCSI bus Level-4 Chip bus bus on-chip busses bus-if bridge SCSI Disk L2 Cache ICache CPUs DCache Main Memory Disk HW accel Local Memory Local Memory Disk Local Memory 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

7 Data transfer and storage power
Power(memory) Power(arithmetic) = 33 Cache miss is probably R+W. Bottom line: factor 33 difference in data transfer to external memory w.r.t. arithmetic operation. 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

8 What about delay of memories?
Global wiring delay becomes dominant over gate delay 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

9 Positioning in the Y-chart
Architecture Instance Applications Applications Applications Data transfer and data storage specific rewrites in the application code Mapping Performance Analysis Performance Numbers 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

10 If you are in a hurry, do this:
Given architecture e.g. TriMedia TM1000 reference C code for application e.g. MPEG-4 Motion Estimation Task map application on architecture But … wait a moment tmcc -o mpeg4_me mpeg4_me.c Thank you for running TriMedia compiler. Your program uses bytes, 78 Watt, and clock cycles 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

11 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman
Let’s help the compiler ... DTSE: data transfer and storage exploration Transforms C-code of the application By focusing on multi-dimensional arrays To better exploit platform capabilities This overview covers the major steps to improve power, area, performance trade-off in the context of platform based design 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

12 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman
Application example Application domain: Computer Tomography in medical imaging Algorithm: Cavity detection in CT-scans Detect dark regions in successive images Indicate cavity in brain Bad news for owner of brain 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

13 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman
Application Max Value Gauss Blur x Gauss Blur y Compute Edges Reverse Detect Roots Reference (conceptual) C code for the algorithm all functions: image_in[N x M] -> image_out[N x M] new value of pixel depends on its neighbors neighbor pixels read from background memory approximately 110 lines of C code (ignoring file I/O etc) experiments with e.g. N x M = 640 x 400 pixels straightforward implementation: 6 image buffers 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

14 Where does this fit in the whole design flow?
Dynamic memory mgmt Task concurrency mgmt Static memory mgmt Address optimization SW design flow HW SW/HW co-design Concurrent OO spec Remove OO overhead Design flow 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

15 DMM (data mem. mgt.) principles
Avoid N-port Memories within real-time constraints Cache Bank Combine local latch 1 & bank 1 local latch N & bank N Processor Data Paths L1 mem/ cache L2 Reduce redundant transfers Exploit memory hierarchy Introduce Locality Off-chip SDRAM DTSE is based on analysis of data transfers from/to main memories Reduce/eliminate redundant transfers: (1) don’t fetch data that you do not need and do not produce data that you will not further use (2) eliminate parts of arrays that are never read/written (dead array access elimination). Introduce data locality (such that data consumption and production are close together in time and data can be stored locally in FG memory) Data reuse: exploit memory hierarchy (store data that are frequently accessed in smaller memories closer to the data-paths: less power in smaller memories and less power in transfer) Avoid N-port memories: bad for power and area. You can usually design a memory architecture with (many) 1-port memories that still meets the cycle-budget constraints for the memory accesses. It happens only very rarely that a solution with an N-port memory is superior to one that contains a number of smaller 1-port memories. Exploit limited life-time of array elements 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

16 DMM steps C-in Preprocessing Dataflow transformations
Loop transformations Data reuse Memory hierarchy layer assignment Cycle budget distribution Memory allocation and assignment Data layout Address optimization C-out 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

17 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman
The DMM steps Preprocessing Rewrite code in 3 layers (parts) Selective inlining, Single Assignment form, .... Data flow transformations Eliminate redundant transfers and storage Loop and control flow transformations Improve regularity of accesses and data locality Data re-use and memory hierarchy layer assignment Determine when to move which data between memories to meet the cycle budget of the application with low cost Determine in which layer to put the arrays (and copies) NOTE on regularity of accesses Assume you have two loops that produce and consume each other’s data and that they do this in a star pattern as shown below: This always means that a buffer is needed. It is much better to reverse one of the loops to get a regular access pattern as shown below, in which we may suffice with a small buffer or maybe even a point solution 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

18 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman
The DMM steps Per memory layer: Cycle budget distribution determine memory access constraints for given cycle budget Memory allocation and assignment which memories to use, and where to put the arrays Data layout determine how to combine and put arrays into memories Address optimization on the final C-code 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

19 Preprocessing: Dividing an application in the 3 layers
Module1a LAYER1 Module2 Module3 Module1b - testbench call - dynamic event behaviour Synchronisation - mode selection for (i=0;i<N; i++) for (j=0; j<M; j++) if (i == 0) B[i][j] = 1; else B[i][j] = func1(A[i][j], A[i-1][j]); LAYER2 int func1(int a, int b) LAYER3 { return a*b; } 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

20 Layered code structure: level 1+2
main(){ /* Layer 1 code */ read_image(IN_NAME, image_in); cav_detect(); write_image(image_out); } void cav_detect() { /* Layer 2 code */ for (x=GB; x<=N-1-GB; ++x) { // filtersize 2GB+1 for (y=GB; y<=M-1-GB; ++y) { gauss_x_tmp = 0; for (k=-GB; k<=GB; ++k) { gauss_x_tmp += in_image[x+k][y] * Gauss[abs(k)]; } gauss_x_image[x][y] = foo(gauss_x_tmp); 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

21 Data-flow trafo - cavity detection
M for (x=0; x<N; ++x) for (y=0; y<M; ++y) gauss_x_image[x][y]=0; M-2 N-2 for (x=1; x<=N-2; ++x) { for (y=1; y<=M-2; ++y) { gauss_x_tmp = 0; for (k=-1; k<=1; ++k) { gauss_x_tmp += image_in[x+k][y]*Gauss[abs(k)]; } gauss_x_image[x][y] = foo(gauss_x_tmp); #accesses: N * M + (N-2) * (M-2) 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

22 Data-flow trafo - cavity detection
M N-2 M-2 for (x=0; x<N; ++x) for (y=0; y<M; ++y) if ((x>=1 && x<=N-2) && (y>=1 && y<=M-2)) { gauss_x_tmp = 0; for (k=-1; k<=1; ++k) { gauss_x_tmp += image_in[x+k][y]*Gauss[abs(k)]; } gauss_x_image[x][y] = foo(gauss_x_tmp); } else { gauss_x_image[x][y] = 0; #accesses: N * M gain is almost 50 % 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

23 Data-flow transformation
In total 5 types of data-flow transformations: advanced signal substitution and (copy) propagation algebraic transformations (associativity, distributive law, etc.) shifting “delay lines” re-computation (instead of keeping values alive too long) transformations to eliminate bottlenecks for subsequent loop transformations 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

24 Data-flow transformation - result
11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

25 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman
Loop transformations Loop transformations improve regularity of accesses improve temporal locality: production  consumption Expected influence reduce temporary storage and (anticipated) background storage for (j=1; j<=M; j++) for (i=1; i<=N; i++) A[i]= foo(A[i]); out[i] = A[i]; for (i=1; i<=N; i++) { for (j=1; j<=M; j++) { A[i] = foo(A[i]); } out[i] = A[i]; storage size N storage size 1 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

26 Global loop transformation steps applied to cavity detection
Make all loop dimensions equal Regularize loop traversal: possibly Y and X loop interchange follow order of input stream Y loop folding and global merging X loop folding and global merging full, global scope regularity nearly complete locality for main signals (arrays) 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

27 Data enters Cavity Detector row-wise
scan device serial scan Buffer = image_in GaussBlur loop Cavity Detector 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

28 Loop trafo - cavity detection
N x M Gauss Blur x N x M Scanner X X-Y Loop Interchange Y From double buffer to single buffer 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

29 Loop interchange (Y  X)
for (x=0;x<N;x++) for (y=0;y<M;y++) /* filtering code */ for (y=0;y<M;y++) for (x=0;x<N;x++) /* filtering code */ Not always possible; check dependences For all loops, to maintain regularity 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

30 Loop trafo - cavity detection
N x (2GB+1) N x 3 Gauss Blur x Gauss Blur y Compute Edges Repeated fold and loop merge 3(offset arrays) 2GB+1 From N x M to N x (2GB+1) buffer size From N x M to N x (3) buffer size 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

31 Improve regularity and locality  Loop Merging
for (y=0;y<M;y++) for (x=0;x<N;x++) /* 1st filtering code */ /* 2nd filtering code */ for (y=0;y<M;y++) for (x=0;x<N;x++) /* 1st filtering code */ /* 2nd filtering code */ !! Often impossible due to dependencies! 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

32 Data dependencies between 1st and 2nd loop
for (y=0;y<M;y++) for (x=0;x<N;x++) gauss_x_image[x][y] = … for (k=-GB; k<=GB; k++) … = … gauss_x_image[x][y+k] … 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

33 Enable merging with Loop Folding (bumping)
for (y=0;y<M;y++) for (x=0;x<N;x++) gauss_x_image[x][y] = … for (y=0+GB;y<M+GB;y++) … y-GB … for (k=-GB; k<=GB; k++) … gauss_x_image[x][y+k-GB] … 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

34 Y-loop merging result of 1st and 2nd loop nest
for (y=0;y<M+GB;y++) if (y<M) for (x=0;x<N;x++) gauss_x_image[x][y] = … if (y>=GB) if (x>=GB && x<=N-1-GB && (y-GB)>=GB && (y-GB)<=M-1-GB) … for (k=-GB; k<=GB; k++) … gauss_x_image[x][y-GB+k] … else 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

35 Simplify conditions in merged loop
for (y=0;y<M+GB;y++) for (x=0;x<N;x++) if (y<M) gauss_x_image[x][y] = … if (y>=GB && x>=GB && x<=N-1-GB && (y-GB)>=GB && (y-GB)<=M-1-GB) … for (k=-GB; k<=GB; k++) … gauss_x_image[x][y-GB+k] … else if (y>=GB) 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

36 Global loop merging/folding steps
1 x  y Loop interchange (done) 2 Global y-loop folding/merging: 1st and 2nd nest (done) 3 Global y-loop folding/merging: 1st/2nd and 3rd nest 4 Global y-loop folding/merging: 1st/2nd/3rd and 4th nest 5 Global x-loop folding/merging: 1st and 2nd nest 6 Global x-loop folding/merging: 1st/2nd and 3rd nest 7 Global x-loop folding/merging: 1st/2nd/3rd and 4th nest 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

37 End result of global loop trafo
for (y=0; y<M+GB+2; ++y) { for (x=0; x<N+2; ++x) { if (x>=GB && x<=N-1-GB && (y-GB)>=GB && (y-GB)<=M-1-GB) { gauss_xy_compute[x][y-GB][0] = 0; for (k=-GB; k<=GB; ++k) gauss_xy_compute[x][y-GB][GB+k+1] = gauss_xy_compute[x][y-GB][GB+k] + gauss_x_image[x][y-GB+k] * Gauss[abs(k)]; gauss_xy_image[x][y-GB] = gauss_xy_compute[x][y-GB][(2*GB)+1]/tot; } else if (x<N && (y-GB)>=0 && (y-GB)<M) gauss_xy_image[x][y-GB] = 0; 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

38 Loop transformations - result
11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

39 Data re-use & memory hierarchy
P = 0.01 M’ P = 0.1 M Main memory P = 1 Processor Data Paths Reg File 100 #A = 100 10 1 P (original) = # access x power/access = 100 P (after) = 100 x x x 1 = 3 Introduce memory hierarchy reduce number of reads from main memory heavily accessed arrays stored in smaller memories 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

40 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman
Data re-use Data flow transformations to introduce extra copies of heavily accessed signals Step 1: figure out data re-use possibilities Step 2: calculate possible gain Step 3: decide on data assignment to memory hierarchy int[2][6] A; for (h=0; h<N; h++) for (i=0; i<2; i++) for (j=0; j<3; j++) for (k=0; k<6; k++) B[j] = A[i][k]; iterations array index (6 * i + k) From CPU you read one access for A. This access you will get from the lowest hierarchy level. This however, has size 1*1 so it has to always fetch the actual data from the next higher level. That level has a capacity to store 6 values, so as an intermediate level it makes sense. Occasionally (every 7th iteration) this layer will also not hold the correct data, so it needs to fetch it from the next higher level …. Actually A is a big array in the application and in this step we focus on figuring out whether we can make a local copy of some part of it. Please note the axes in the graph. The vertical one gives the array index. For i = 0, the range is between 1 and 6. For i = 1, the range is between 7 and 12 (A[1][1] = A[1*6+1] = A[7] and A[1][6] = A[1*6 + 6] = 12 etc …). 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

41 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman
Data re-use Data flow transformations to introduce extra copies of heavily accessed signals Step 1: figure out data re-use possibilities Step 2: calculate possible gain Step 3: decide on data assignment to memory hierarchy 1*2*1*6 N*2*3*6 CPU iterations frame1 frame2 frame3 array index 6*2 From CPU you read one access for A. This access you will get from the lowest hierarchy level. This however, has size 1*1 so it has to always fetch the actual data from the next higher level. That level has a capacity to store 6 values, so as an intermediate level it makes sense. Occasionally (every 7th iteration) this layer will also not hold the correct data, so it needs to fetch it from the next higher level …. N*2*1*6 6*1 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

42 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman
Data re-use tree image_in image_out gauss_x gauss_xy/comp_edge N*M N*M M*3 M*3 M*3 N*M N*M*3 N*M*3 N*M N*1 1*1 1*3 3*3 N*M N*M*3 N*M*8 N*M*8 3*1 N*M*3 CPU CPU CPU CPU CPU 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

43 Memory hierarchy assignment
image_in image_out gauss_x gauss_xy comp_edge L2 N*M N*M 1MB SDRAM N*M L1 M*3 M*3 M*3 16KB Cache N*M*3 N*M*3 N*M N*M*3 N*M 128 B RegFile L0 1*1 1*1 3*1 3*3 3*3 N*M*3 N*M*8 N*M*8 N*M*8 N*M*8 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

44 Data-reuse - cavity detection code
Code before reuse transformation for (y=0; y<M+3; ++y) { for (x=0; x<N+2; ++x) { if (x>=1 && x<=N-2 && y>=1 && y<=M-2) { gauss_x_tmp = 0; for (k=-1; k<=1; ++k) gauss_x_tmp += image_in[x+k][y]*Gauss[abs(k)]; gauss_x_image[x][y] = foo(gauss_x_compute); } else { if (x<N && y<M) gauss_x_lines[x][y] = 0; } /* Other merged code omitted … */ } 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

45 Data-reuse - cavity detection code
Code after reuse transformation: for (y=0; y<M+3; ++y) { for (x=0; x<N+2; ++x) { /* first in_pixels initialized */ if (x==0 && y>=1 && y<=M-2) for (k=0; k<1; ++k) in_pixels[(x+k)%3] = image_in[x+k][y]; /* copy rest of in_pixels in row */ if (x>=0 && x<=N-2 && y>=1 && y<=M-2) in_pixels[(x+1)%3] = image_in[x+1][y]; if (x>=1 && x<=N-1-1 && y>=1 && y<=M-2) { gauss_x_tmp=0; for (k=-1; k<=1; ++k) gauss_x_tmp += in_pixels[(x+k)%3]*Gauss[Abs(k)]; gauss_x_lines[x][y%3]= foo(gauss_x_tmp); } else if (x<N && y<M) gauss_x_lines[x][y%3] = 0; 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

46 Data reuse & memory hierarchy
11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

47 Data layout optimization
At this point multi-dimensional arrays are to be assigned to physical memories Data layout optimization determines exactly where in each memory an array should be placed, to reduce memory size by “in-placing” arrays that do not overlap in time (disjoint lifetimes) to avoid cache misses due to conflicts exploit spatial locality of the data in memory to improve performance of e.g. page-mode memory access sequences 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

48 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman
In-place mapping A B C D E Inter in-place B C D A Both intra+inter C A D B E B C A D E addresses E Intra in-place time 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

49 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman
In-place mapping Implements all the “anticipated” memory size savings obtained in previous steps Modifies code to introduce one array per “real” memory Changes indices to addresses in mem. arrays 0x0 0x28a0 b8 A[100][100]; b6 B[20][20]; for (i,j,k,l; …) B[i][j] = f(B[j][i], A[i+k][j+l]); A b8 mem1[10400]; for (i,j,k,l; …) mem1[10000+i+20*j] = f(mem1[10000+j+20*i], b6(mem1[i+k+100*(j+l)]); B 0x2710 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

50 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman
In-place mapping Input image is partly consumed, and not reused again, by the time first results for output image are ready index time address Image Image_in time index Image_out time 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

51 In-place - cavity detection code
for (y=0; y<=M+3; ++y) { for (x=0; x<N+5; ++x) { image_out[x-5][y-3] = …; /* code removed */ … = image_in[x+1][y]; } for (y=0; y<=M+3; ++y) { for (x=0; x<N+5; ++x) { image[x-5][y-3] = …; /* code removed */ … = image [x+1][y]; } 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

52 In-place mapping - results
11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

53 The last step: ADOPT (Address OPTimization)
Increased execution time introduced by DTSE Complicated address arithmetic (modulo!) Additional complex control flow Additional transformations needed to Simplify control flow Simplify address arithmetic: common sub-expression elimination, modulo expansion, … Match remaining expressions on target machine 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

54 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman
ADOPT principles Example: Full-search Motion Estimation for (i=- 8; i<=8; i++) { for (j=- 4; j<=3; j++) { for (k=- 4; k<=3; k++) A[((208+i)*257+8+j)* i+k] = B[(8+j)* i+k]; } dist += A[3096] - B[((208+i)*257+4)* i-4]; } for (i=-8; i<=8; i++) { for (j=- 4; j<=3; j++) { for (k=- 4; k<=3; k++) { B[ ] = A[ ]; }} dist += A[ ]- B[ ]; } cse1 = (33025*i )*2; cse3 = 1040+i; cse4 = j* ; cse5 = k+cse4; cse5+cse1 = cse5+cse3 cse1 Algebraic transformations at word-level 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

55 Address optimization - result
11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

56 Fixing platform parameters
Assume configurable on-chip memory hierarchy Trade-off power versus cycle-budget power [mW] 25 20 15 10 5 storage cycle budget 50,000 100,000 150,000 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

57 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman
Conclusion Many applications use large (static) data structures Access and layout of this data can be heavily optimized Compilers don't do this Source code (C-to-C) transformations needed !! Showed systematic approach Platform independent high-level transformations Platform dependent transformations exploit platform characteristics (optimal use of cache, …) Substantial energy, memory size (cost) and performance improvements MPEG-4, OFDM, H.263, ADSL, ... 11/14/2018 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman


Download ppt "Embedded Computer Architecture"

Similar presentations


Ads by Google