Embedded Systems in Silicon TD5102 Data Management (1) Overview Henk Corporaal Technical University.

Slides:

Advertisements

Similar presentations

Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow

Advertisements

Main MemoryCS510 Computer ArchitecturesLecture Lecture 15 Main Memory.

Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.

ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto

Lecture 8: Memory Hierarchy Cache Performance Kai Bu

1 Code Optimization Code produced by compilation algorithms can often be improved (ideally optimized) in terms of run-time speed and the amount of memory.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Memory Management Part b: Loop transformations & Data Reuse.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Embedded Systems in Silicon TD5102 Data Management (3) SCBD, MAA, and Data Layout Henk Corporaal Technical.

4 July 2005 overview Traineeship: Mapping of data structures in multiprocessor systems Nick de Koning

Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman Data Memory Management Part b: Loop transformations.

Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout.

Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Bart Mesman Data Memory Management Part d: Data Layout for Caches.

Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.

Reconfigurable Computing S. Reda, Brown University Reconfigurable Computing (EN2911X, Fall07) Lecture 09: RC Principles: Software (2/4) Prof. Sherief Reda.

Prof. Bodik CS 164 Lecture 171 Register Allocation Lecture 19.

331 Lec20.1Fall :332:331 Computer Architecture and Assembly Language Fall 2003 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.

Embedded Systems in Silicon TD5102 Data Management (2) Loop transformations & Data reuse Henk Corporaal

Register Allocation (via graph coloring)

CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral

Register Allocation (via graph coloring). Lecture Outline Memory Hierarchy Management Register Allocation –Register interference graph –Graph coloring.

331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.

1 Liveness analysis and Register Allocation Cheng-Chia Chen.

Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Bart Mesman Data Memory Management Part b: Loop transformations & Data Reuse.

Virtual Memory Topics Virtual Memory Access Page Table, TLB Programming for locality Memory Mountain Revisited.

An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.

4/29/09Prof. Hilfinger CS164 Lecture 381 Register Allocation Lecture 28 (from notes by G. Necula and R. Bodik)

Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine.

Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Data Management Part c: SCBD, MAA, and Data Layout.

Overview Booth’s Algorithm revisited Computer Internal Memory Cache memory.

Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.

- 1 -  P. Marwedel, Univ. Dortmund, Informatik 12, 2003 Universität Dortmund Actual design flows and tools.

Systems I Locality and Caching

- 1 - EE898-HW/SW co-design Hardware/Software Codesign “Finding right combination of HW/SW resulting in the most efficient product meeting the specification”

Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.

Sub-expression elimination Logic expressions: –Performed by logic optimization. –Kernel-based methods. Arithmetic expressions: –Search isomorphic patterns.

CSc 453 Final Code Generation Saumya Debray The University of Arizona Tucson.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

RISC architecture and instruction Level Parallelism (ILP) based on “Computer Architecture: a Quantitative Approach” by Hennessy and Patterson, Morgan Kaufmann.

ASCI Winterschool on Embedded Systems March 2004 Renesse Data Memory Management Henk Corporaal Peter Knijnenburg.

Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:

1 Code optimization “Code optimization refers to the techniques used by the compiler to improve the execution efficiency of the generated object code”

High-Level Transformations for Embedded Computing

Séminaire COSI-Roscoff’011 Séminaire COSI ’01 Power Driven Processor Array Partitionning for FPGA SoC S.Derrien, S. Rajopadhye.

System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.

Compiler Optimizations ECE 454 Computer Systems Programming Topics: The Role of the Compiler Common Compiler (Automatic) Code Optimizations Cristiana Amza.

High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.

Sunpyo Hong, Hyesoon Kim

3/2/2016© Hal Perkins & UW CSES-1 CSE P 501 – Compilers Optimizing Transformations Hal Perkins Autumn 2009.

The Effect of Data-Reuse Transformations on Multimedia Applications for Application Specific Processors N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.

PRESENTED BY: MOHAMAD HAMMAM ALSAFRJALANI UFL ECE Dept. 3/31/2010 UFL ECE Dept 1 CACHE OPTIMIZATION FOR AN EMBEDDED MPEG-4 VIDEO DECODER.

RAM RAM - random access memory RAM (pronounced ramm) random access memory, a type of computer memory that can be accessed randomly;

©SoftMoore ConsultingSlide 1 Code Optimization. ©SoftMoore ConsultingSlide 2 Code Optimization Code generation techniques and transformations that result.

Buffering Techniques Greg Stitt ECE Department University of Florida.

Code Optimization Code produced by compilation algorithms can often be improved (ideally optimized) in terms of run-time speed and the amount of memory.

Code Optimization Overview and Examples

Code Optimization.

Cache Memories CSE 238/2038/2138: Systems Programming

The Hardware/Software Interface CSE351 Winter 2013

Optimization Code Optimization ©SoftMoore Consulting.

5.2 Eleven Advanced Optimizations of Cache Performance

Morgan Kaufmann Publishers Memory & Cache

Embedded Computer Architecture

Florin Balasa University of Illinois at Chicago

Optimizing Transformations Hal Perkins Winter 2008

Main Memory Background

Introduction to Optimization

Platform-based Design

Presentation transcript:

Embedded Systems in Silicon TD5102 Data Management (1) Overview Henk Corporaal Technical University Eindhoven DTI / NUS Singapore 2005/2006

H.C. TD51022 Data Management Overview Motivation Example application Data Management (DM) steps Results Important note: -We consider here static declared data structures only -DM is also called -DTSE (Data Transfer and Storage Exploration), or -Physical Memory Management

H.C. TD51023 Dynamic memory mgmt Task concurrency mgmt Physical memory mgmt Address optimization SWdesignflowHWdesignflow SW/HW co-design SW/HW co-design Concurrent OO spec Remove OO overhead

H.C. TD51024 VLIW cpu I$ video-in video-out audio-in audio-out PCI bridge Serial I/O timersI 2 C I/O SDRAM D$ for (i=0;i<n;i++) for (j=0; j<3; j++) for (k=1; k<7; k++) B[j] = A[i*4+k]; SDRAM D$ Data storage bottleneck B[j] = A[i*4+k]; The underlying idea B[j] = A[i*4+k]; Data transfer bottleneck B[j] = A[i*4+k];

H.C. TD51025 Platform architecture model CPUs HW accel Level-1 Level-2Level-3Level-4 ICache Local Memory Disk Main Memory bus-if on-chip busses Local Memory L2 Cache Local Memory bridgeSCSI DCacheDisk bus SCSI bus Chip

H.C. TD51026 Platform example: TriMedia 5 out of 27 processor FU’s 128*32b 16-port RegFile Hardware accelerators TriMedia TM M 1-port SDRAM 16K 2-port SRAM 256M 1-port SDRAM SW cache 8KB TriMedia TM1000 cache HW cache 8/16KB CPU Cache bypass SW controlled HW controlled

H.C. TD51027 Data transfer and storage power Power(memory) Power(arithmetic) = 33

H.C. TD51028 Applications Architecture Instance Mapping Applications Performance Analysis Performance Numbers Data transfer and data storage specific rewrites in the application code Positioning in the Y-chart

H.C. TD51029 Current practice Mapping, easy, but Given –reference C code for application e.g. MPEG-4 Motion Estimation –platform: SUPERDUPER-LX50 Task –map application on architecture But … wait a moment CC –o2 mpeg4_me mpeg4_me.c Thank you for running SUPERDUPER-LX50 compiler. Your program uses bytes memory, 78 Watt, clock cycles a=b*5+d; for (...) {.. } Idea

H.C. TD Let’s help the compiler... DTSE: data transfer and storage exploration DTSE is a methodology to explore data-transfer and data-storage in multi-media applications –Transforms C-code of the application –By focusing on multi-dimensional signals (arrays) –To better exploit platform capabilities This overview covers the major steps to improve power, area, performance trade-off

H.C. TD Data Management principles Processor Data Paths L1 cache L2 cache Cache Bank Combine local latch 1 & bank 1 local latch N & bank N Exploit memory hierarchy Off-chip SDRAM Exploit limited life-time Avoid N-port Memories within real-time constraints Reduce redundant transfers Introduce Locality

H.C. TD DM steps C-in Preprocessing Dataflow transformations Loop transformations Data reuse Memory hierarchy layer assignment Cycle budget distribution Memory allocation and assignment Data layout C-out Address optimization

H.C. TD The DM steps Preprocessing –Rewrite code in 3 layers (parts) –Selective inlining, Single Assignment form,.... Data flow transformations –Eliminate redundant transfers and storage Loop and control flow transformations –Improve regularity of accesses and data locality Data re-use and memory hierarchy layer assignment –Determine when to move which data between memories to meet the cycle budget of the application with low cost –Determine in which layer to put the arrays (and copies)

H.C. TD The DM steps Per memory layer: Cycle budget distribution –determine memory access constraints for given cycle budget Memory allocation and assignment –which memories to use, and where to put the arrays Data layout –determine how to combine and put arrays into memories Address optimization on the final C-code

H.C. TD Application example Application domain: –Computer Tomography in medical imaging Algorithm: –Cavity detection in CT-scans –Detect dark regions in successive images –Indicate cavity in brain  Bad news for owner of brain

H.C. TD Data enters Cavity Detector row-wise scan device Buffer serial scan Cavity Detector GaussBlur loop = image_in

H.C. TD Application Reference (conceptual) C code for the algorithm –all functions: image_in[N x M] t-1 -> image_out[N x M] t –new value of pixel depends on its neighbors –neighbor pixels read from background memory –approximately 110 lines of C code (ignoring file I/O etc) –experiments with N x M = 640 x 400 pixels –straightforward implementation: 6 image buffers Compute Edges Gauss Blur x Reverse Detect Roots Max Value Gauss Blur y

H.C. TD Preprocessing: Dividing an application in the 3 layers Module1a Module1b Module2Module3 Synchronisation - testbench call - dynamic event behaviour - mode selection for (i=0;i<N; i++) for (j=0; j<M; j++) if (i == 0) B[i][j] = 1; else B[i][j] = func1(A[i][j], A[i-1][j]); int func1(int a, int b) { return a*b; } LAYER1 LAYER2 LAYER3

H.C. TD main(){ /* Layer 1 code */ read_image(IN_NAME, image_in); cav_detect(); write_image(image_out); } void cav_detect() { /* Layer 2 code */ for (x=GB; x<=N-1-GB; ++x) { for (y=GB; y<=M-1-GB; ++y) { gauss_x_tmp = 0; for (k=-GB; k<=GB; ++k) { gauss_x_tmp += in_image[x+k][y] * Gauss[abs(k)]; } gauss_x_image[x][y] = foo(gauss_x_tmp); } Layered code structure

H.C. TD Layered code structure int foo(int arg1) { /* Layer 3 */ /* arithmetic, data-dependent operations * to be mapped to data-path, controller */ } void cav_detect() {/* Layer 2 code */ for (x=GB; x<=N-1-GB; ++x) { for (y=GB; y<=M-1-GB; ++y) { gauss_x_tmp = 0; for (k=-GB; k<=GB; ++k) { gauss_x_tmp += in_image[x+k][y] * Gauss[abs(k)]; } gauss_x_image[x][y] = foo(gauss_x_tmp); } }/* Makes code for data access */ }/* and data transfer explicit */

H.C. TD N M Data-flow trafo - cavity detection for (x=0; x<N; ++x) for (y=0; y<M; ++y) gauss_x_image[x][y]=0; for (x=1; x<=N-2; ++x) { for (y=1; y<=M-2; ++y) { gauss_x_tmp = 0; for (k=-1; k<=1; ++k) { gauss_x_tmp += image_in[x+k][y]*Gauss[abs(k)]; } gauss_x_image[x][y] = foo(gauss_x_tmp); } M-2 N-2 #accesses: N * M + (N-2) * (M-2)

H.C. TD Data-flow trafo - cavity detection N M N-2 M-2 for (x=0; x<N; ++x) for (y=0; y<M; ++y) if ((x>=1 && x<=N-2) && (y>=1 && y<=M-2)) { gauss_x_tmp = 0; for (k=-1; k<=1; ++k) { gauss_x_tmp += image_in[x+k][y]*Gauss[abs(k)]; } gauss_x_image[x][y] = foo(gauss_x_tmp); } else { gauss_x_image[x][y] = 0; } #accesses: N * M gain is ± 50 %

H.C. TD Data-flow transformation In total 5 types of data-flow transformations: –advanced signal substitution and (copy) propagation –algebraic transformations (associativity, etc.) –shifting “delay lines” –re-computation –transformations to eliminate bottlenecks for subsequent loop transformations

H.C. TD Loop transformations –improve regularity of accesses –improve temporal locality: production  consumption Expected influence –reduce temporary storage and (anticipated) background storage storage size N Loop transformations for (j=1; j<=M; j++) for (i=1; i<=N; i++) A[i]= foo(A[i]); for (i=1; i<=N; i++) out[i] = A[i]; for (i=1; i<=N; i++) { for (j=1; j<=M; j++) { A[i] = foo(A[i]); } out[i] = A[i]; } storage size 1

H.C. TD Global loop transformation steps applied to cavity detection Removal of data-flow bottleneck – allows merging of loops – done in global data-flow trafo step Make all loop dimensions equal Regularize loop traversal: Y and X loop interchange – follow order of input stream Y loop folding and global merging X loop folding and global merging – full, global scope regularity – nearly complete locality for main signals

H.C. TD Scanner Loop trafo - cavity detection N x M Gauss Blur x N x M From double buffer to single buffer X Y X-Y Loop Interchange

H.C. TD Single assignment  always possible For all loops, to maintain regularity Loop interchange (Y  X) for (x=0;x<N;x++) for (y=0;y<M;y++) /* filtering code */ for (y=0;y<M;y++) for (x=0;x<N;x++) /* filtering code */

H.C. TD Loop trafo - cavity detection Compute Edges Gauss Blur y N x (2GB+1) Repeated fold and loop merge N x 3 From N x M to N x (3) buffer size From N x M to N x (2GB+1) buffer size 2GB+1 3(offset arrays) Gauss Blur x

H.C. TD for (y=0;y<M;y++) for (x=0;x<N;x++) /* 1st filtering code */ for (y=0;y<M;y++) for (x=0;x<N;x++) /* 2nd filtering code */ for (y=0;y<M;y++) for (x=0;x<N;x++) /* 1st filtering code */ for (x=0;x<N;x++) /* 2nd filtering code */ Improve regularity and locality  Loop Merging !! Impossible due to dependencies!

H.C. TD Data dependencies between 1st and 2nd loop for (y=0;y<M;y++) for (x=0;x<N;x++) … gauss_x_image[x][y] = … for (y=0;y<M;y++) for (x=0;x<N;x++) … for (k=-GB; k<=GB; k++) … = … gauss_x_image[x][y+k] …

H.C. TD Enable merging with Loop Folding (bumping) for (y=0;y<M;y++) for (x=0;x<N;x++) … gauss_x_image[x][y] = … for (y=0+GB;y<M+GB;y++) for (x=0;x<N;x++) … y-GB … for (k=-GB; k<=GB; k++) … gauss_x_image[x][y+k-GB] …

H.C. TD Y-loop merging on 1st and 2nd loop nest for (y=0;y<M+GB;y++) if (y<M) for (x=0;x<N;x++) … gauss_x_image[x][y] = … if (y>=GB) for (x=0;x<N;x++) if (x>=GB && x<=N-1-GB && (y-GB)>=GB && (y-GB)<=M-1-GB) … for (k=-GB; k<=GB; k++) … gauss_x_image[x][y-GB+k] … else

H.C. TD Simplify conditions in merged loop nest for (y=0;y<M+GB;y++) for (x=0;x<N;x++) if (y<M) … gauss_x_image[x][y] = … for (x=0;x<N;x++) if (y>=GB && x>=GB && x<=N-1-GB && (y-GB)>=GB && (y-GB)<=M-1-GB) … for (k=-GB; k<=GB; k++) … gauss_x_image[x][y-GB+k] … else if (y>=GB)

H.C. TD Global loop merging/folding steps 1 x  y Loop interchange (done) 2 Global y-loop folding/merging: 1st and 2nd nest (done) 3 Global y-loop folding/merging: 1st/2nd and 3rd nest 4 Global y-loop folding/merging: 1st/2nd/3rd and 4th nest 5 Global x-loop folding/merging: 1st and 2nd nest 6 Global x-loop folding/merging: 1st/2nd and 3rd nest 7 Global x-loop folding/merging: 1st/2nd/3rd and 4th nest

H.C. TD End result of global loop trafo for (y=0; y<M+GB+2; ++y) { for (x=0; x<N+2; ++x) { … if (x>=GB && x<=N-1-GB && (y-GB)>=GB && (y-GB)<=M-1-GB) { gauss_xy_compute[x][y-GB][0] = 0; for (k=-GB; k<=GB; ++k) gauss_xy_compute[x][y-GB][GB+k+1] = gauss_xy_compute[x][y-GB][GB+k] + gauss_x_image[x][y-GB+k] * Gauss[abs(k)]; gauss_xy_image[x][y-GB] = gauss_xy_compute[x][y-GB][(2*GB)+1]/tot; } else if (x =0 && (y-GB)<M) gauss_xy_image[x][y-GB] = 0; …

H.C. TD #A = 100 P (original) = # access x power/access = 100 Processor Data Paths Reg File M Main memory P = 1 M’ P = 0.1 M’’ P = Data re-use & memory hierarchy Introduce memory hierarchy –reduce number of reads from main memory –heavily accessed arrays stored in smaller memories P (after) = 100 x x x 1 = 3

H.C. TD Data re-use Data flow transformations to introduce extra copies of heavily accessed signals –Step 1: figure out data re-use possibilities –Step 2: calculate possible gain –Step 3: decide on data assignment to memory hierarchy int[2][6] A; for (h=0; h<N; h++) for (i=0; i<2; i++) for (j=0; j<3; j++) for (k=1; k<7; k++) B[j] = A[i][k]; iterations array index (6 * i + k)

H.C. TD Data re-use Data flow transformations to introduce extra copies of heavily accessed signals –Step 1: figure out data re-use possibilities –Step 2: calculate possible gain –Step 3: decide on data assignment to memory hierarchy iterations frame1frame2frame3 array index 6*2 6*1 N*2*3*6 CPU 1*2*1*6 N*2*1*6

H.C. TD Data re-use tree N*M N*1 3*1 image_in M*3 1*3 gauss_x M*3 3*3 gauss_xy/comp_edge M*3 1*1 N*M*3 N*M N*M*3 N*M image_out 0 N*M*8 CPU

H.C. TD Memory hierarchy assignment L3 L2 L1 N*M 3*1 image_in M*3 gauss_x gauss_xycomp_edgeimage_out 3*3 1*1 3*3 1*1 N*M N*M*3 N*M 0 N*M*3 N*M N*M*3N*M*8 M*3 1MB SDRAM 16KB Cache 128 B RegFile

H.C. TD Data-reuse - cavity detection code for (y=0; y<M+3; ++y) { for (x=0; x<N+2; ++x) { if (x>=1 && x =1 && y<=M-2) { gauss_x_tmp = 0; for (k=-1; k<=1; ++k) gauss_x_tmp += image_in[x+k][y]*Gauss[abs(k)]; gauss_x_image[x][y] = foo(gauss_x_compute); } else { if (x<N && y<M) gauss_x_lines[x][y] = 0; } /* Other merged code omitted … */ } Code before reuse transformation

H.C. TD Data-reuse - cavity code for (y=0; y<M+3; ++y) { for (x=0; x<N+2; ++x) { /* first in_pixel initialized */ if (x==0 && y>=1 && y<=M-2) for (k=0; k<1; ++k) in_pixels[(x+k)%3][y%1] = image_in[x+k][y]; /* copy rest of in_pixel's in row */ if (x>=0 && x =1 && y<=M-2) in_pixels[(x+1)%3][y%1] = image_in[x+1][y]; if (x>=1 && x =1 && y<=M-2) { gauss_x_tmp=0; for (k=-1; k<=1; ++k) gauss_x_tmp += in_pixels[(x+k)%3][y%1]*Gauss[Abs(k)]; gauss_x_lines[x][y%3]= foo(gauss_x_tmp); } else if (x<N && y<M) gauss_x_lines[x][y%3] = 0; Code after reuse transformation detection

H.C. TD Data layout optimization At this point multi-dimensional arrays are to be assigned to physical memories Data layout optimization determines exactly where in each memory an array should be placed, to –reduce memory size by “in-placing” arrays that do not overlap in time (disjoint lifetimes) –to avoid cache misses due to conflicts –exploit spatial locality of the data in memory to improve performance of e.g. page-mode memory access sequences

H.C. TD In-place mapping B C D A C A D B E C A D E B time E addresses A B C D E Inter in-place Intra in-place

H.C. TD x0 0x28a0 B A In-place mapping Implements all the “anticipated” memory size savings obtained in previous steps Modifies code to introduce one array per “real” memory Changes indices to addresses in mem. arrays b8 A[100][100]; b6 B[20][20]; for (i,j,k,l; …) B[i][j] = f(B[j][i], A[i+k][j+l]); b8 mem1[10400]; for (i,j,k,l; …) mem1[10000+i+20*j] = f(mem1[10000+j+20*i], b6(mem1[i+k+100*(j+l)])); 0x2710

H.C. TD In-place mapping Input image is partly consumed by the time first results for output image are ready Image_out index time Image_in time index time address Image

H.C. TD In-place - cavity detection code for (y=0; y<=M+3; ++y) { for (x=0; x<N+5; ++x) { image_out[x-5][y-3] = …; /* code removed */ … = image_in[x+1][y]; } for (y=0; y<=M+3; ++y) { for (x=0; x<N+5; ++x) { image[x-5][y-3] = …; /* code removed */ … = image [x+1][y]; }

H.C. TD The last step: ADOPT (Address OPTimization) Increased execution time introduced by DTSE –Complicated address arithmetic (modulo!) –Additional complex control flow Multimedia platform not adapted to address calculations Additional transformations needed to –Simplify control flow –Simplify address arithmetic: common sub-expression elimination, modulo expansion, … –Match remaining expressions on target machine

H.C. TD ADOPT principles Processor specific algebraic transformations Optimized behavioral descr. for target processor Compile to target processor Behavioral description Extract address expr. code Perform addr. expr. splitting Apply transformations: - Loop invariant code motion - Induction variable analysis - Algebraic transformations Optimized behavioral descr. Map to custom ACU

H.C. TD for (i=-8; i<=8; i++) { for (j=- 4; j<=3; j++) { for (k=- 4; k<=3; k++) { Ad[ ] = A[ ]; }} dist += A[ ]-Ad[ ]; } cse1 = (33025*i )*2; cse3 = 1040+i; cse4 = j* ; cse5 = k+cse4; cse5+cse1 = cse5+cse cse1 ADOPT principles for (i=- 8; i<=8; i++) { for (j=- 4; j<=3; j++) { for (k=- 4; k<=3; k++) A[((208+i)*257+8+j)* i+k] = B[(8+j)* i+k]; } dist += A[3096] - B[((208+i)*257+4)* i-4]; } Example: Full-search Motion Estimation Algebraic transformations at word-level

H.C. TD DMM – results for cavity detection on ASIC

H.C. TD Cavity detection on Pentium-MMX Main Memory AccessesLocal Memory AccessesExecution Time (sec)

H.C. TD Applications Architecture Instance Mapping Applications Performance Analysis Performance Numbers Data transfer and data storage specific rewrites in the application code Data transfer and data storage specific platform customization The Y-chart revisited

H.C. TD Fixing platform parameters Assume configurable on-chip memory hierarchy –Trade-off power versus cycle-budget storage cycle budget power [mW] 50,000100,000150,

H.C. TD Conclusion In multi-media applications exploring data transfer and storage issues should be done at system level DTSE is a methodology for Data Transfer and Storage Exploration based on manual and/or tool- assisted code rewriting –Platform independent high-level transformations –Platform dependent transformations exploit platform characteristics (optimal use of cache, …) –Substantial reduction in power and memory size demonstrated on MPEG-4, OFDM, H.263, ADSL,...