ASCI Winterschool on Embedded Systems March 2004 Renesse Data Memory Management Henk Corporaal Peter Knijnenburg.

Slides:



Advertisements
Similar presentations
Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow
Advertisements

Main MemoryCS510 Computer ArchitecturesLecture Lecture 15 Main Memory.
Synopsys University Courseware Copyright © 2012 Synopsys, Inc. All rights reserved. Compiler Optimization and Code Generation Lecture - 3 Developed By:
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
1 Code Optimization Code produced by compilation algorithms can often be improved (ideally optimized) in terms of run-time speed and the amount of memory.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Memory Management Part b: Loop transformations & Data Reuse.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Embedded Systems in Silicon TD5102 Data Management (3) SCBD, MAA, and Data Layout Henk Corporaal Technical.
Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman Data Memory Management Part b: Loop transformations.
Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout.
Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.
 Data copy forms part of an auto-tuning compiler framework.  Auto-tuning compiler, while using the library, can empirically evaluate the different implementations.
Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Bart Mesman Data Memory Management Part d: Data Layout for Caches.
Embedded Systems in Silicon TD5102 Data Management (1) Overview Henk Corporaal Technical University.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Prof. Bodik CS 164 Lecture 171 Register Allocation Lecture 19.
331 Lec20.1Fall :332:331 Computer Architecture and Assembly Language Fall 2003 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
Embedded Systems in Silicon TD5102 Data Management (2) Loop transformations & Data reuse Henk Corporaal
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Register Allocation (via graph coloring)
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.
CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral
Register Allocation (via graph coloring). Lecture Outline Memory Hierarchy Management Register Allocation –Register interference graph –Graph coloring.
331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 30, 2002 Topic: Caches (contd.)
1 Liveness analysis and Register Allocation Cheng-Chia Chen.
Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.
Compiler Optimization Overview
Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Bart Mesman Data Memory Management Part b: Loop transformations & Data Reuse.
High Performance Embedded Computing © 2007 Elsevier Lecture 11: Memory Optimizations Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.
4/29/09Prof. Hilfinger CS164 Lecture 381 Register Allocation Lecture 28 (from notes by G. Necula and R. Bodik)
Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Data Management Part c: SCBD, MAA, and Data Layout.
EEL 6935 Embedded Systems Long Presentation 2 Group Member: Qin Chen, Xiang Mao 4/2/20101.
- 1 -  P. Marwedel, Univ. Dortmund, Informatik 12, 2003 Universität Dortmund Actual design flows and tools.
Systems I Locality and Caching
Generic Software Pipelining at the Assembly Level Markus Pister
Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
Optimization software for apeNEXT Max Lukyanov,  apeNEXT : a VLIW architecture  Optimization basics  Software optimizer for apeNEXT  Current.
Sub-expression elimination Logic expressions: –Performed by logic optimization. –Kernel-based methods. Arithmetic expressions: –Search isomorphic patterns.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Register Allocation John Cavazos University.
CSc 453 Final Code Generation Saumya Debray The University of Arizona Tucson.
High Performance Embedded Computing © 2007 Elsevier Chapter 3, part 1: Programs High Performance Embedded Computing Wayne Wolf.
RISC architecture and instruction Level Parallelism (ILP) based on “Computer Architecture: a Quantitative Approach” by Hennessy and Patterson, Morgan Kaufmann.
Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:
Introduction to Computer Systems Topics: Theme Five great realities of computer systems (continued) “The class that bytes”
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.
Memory-Aware Compilation Philip Sweany 10/20/2011.
CS412/413 Introduction to Compilers and Translators April 2, 1999 Lecture 24: Introduction to Optimization.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
The Effect of Data-Reuse Transformations on Multimedia Applications for Application Specific Processors N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.
University at Albany,SUNY lrm-1 lrm 6/28/2016 Levels of Processor/Memory Hierarchy Can be Modeled by Increasing Dimensionality of Data Array. –Additional.
©SoftMoore ConsultingSlide 1 Code Optimization. ©SoftMoore ConsultingSlide 2 Code Optimization Code generation techniques and transformations that result.
Parallel Computing Chapter 3 - Patterns R. HALVERSON MIDWESTERN STATE UNIVERSITY 1.
Code Optimization Code produced by compilation algorithms can often be improved (ideally optimized) in terms of run-time speed and the amount of memory.
Code Optimization.
Reducing Hit Time Small and simple caches Way prediction Trace caches
Cache Memories CSE 238/2038/2138: Systems Programming
The Hardware/Software Interface CSE351 Winter 2013
Optimization Code Optimization ©SoftMoore Consulting.
5.2 Eleven Advanced Optimizations of Cache Performance
Embedded Computer Architecture
Florin Balasa University of Illinois at Chicago
Morgan Kaufmann Publishers Memory Hierarchy: Virtual Memory
Final Project presentation
Introduction to Optimization
Platform-based Design
Presentation transcript:

ASCI Winterschool on Embedded Systems March 2004 Renesse Data Memory Management Henk Corporaal Peter Knijnenburg

ASCI winterschool H.C.-P.K.2 Data Memory Management Overview Motivation Example application DMM steps Results Notes: -We concentrate on Static Data Memory Management‘ -The Data Transfer and Storage Exploration (DTSE) methodology, on which these slides are based, has been developed by IMEC, Leuven

ASCI winterschool H.C.-P.K.3 VLIW cpu I$ video-in video-out audio-in audio-out PCI bridge Serial I/O timersI 2 C I/O SDRAM D$ for (i=0;i<n;i++) for (j=0; j<3; j++) for (k=1; k<7; k++) B[j] = A[i*4+k]; SDRAM D$ Data storage bottleneck B[j] = A[i*4+k]; The underlying idea B[j] = A[i*4+k]; Data transfer bottleneck B[j] = A[i*4+k];

ASCI winterschool H.C.-P.K.4 Platform architecture model CPUs HW accel Level-1 Level-2Level-3Level-4 ICache Local Memory Disk Main Memory bus-if on-chip busses Local Memory L2 Cache Local Memory bridgeSCSI DCacheDisk bus SCSI bus Chip

ASCI winterschool H.C.-P.K.5 Data transfer and storage power Power(memory) Power(arithmetic) = 33

ASCI winterschool H.C.-P.K.6 Applications Architecture Instance Mapping Applications Performance Analysis Performance Numbers Data transfer and data storage specific rewrites in the application code Positioning in the Y-chart

ASCI winterschool H.C.-P.K.7 Mapping Given –architecture SuperDuperXYZ –reference C code for application e.g. MPEG-4 Motion Estimation Task –map application on this architecture But … wait a moment sdcc -o mpeg4_me mpeg4_me.c Thank you for running SuperDuperXYZ compiler. Your program uses bytes memory, 78 Watt, clock cycles Let’s help the compiler

ASCI winterschool H.C.-P.K.8 Application example Application domain: –Computer Tomography in medical imaging Algorithm: –Cavity detection in CT-scans –Detect dark regions in successive images –Indicate cavity in brain Bad news for owner of brain

ASCI winterschool H.C.-P.K.9 Application Reference (conceptual) C code for the algorithm –all functions: image_in[N x M] t-1 -> image_out[N x M] t –new value of pixel depends on its neighbors –neighbor pixels read from background memory –approximately 110 lines of C code (ignoring file I/O etc) –experiments with N x M = 640 x 400 pixels –straightforward implementation: 6 image buffers Compute Edges Gauss Blur x Reverse Detect Roots Max Value Gauss Blur y

ASCI winterschool H.C.-P.K.10 DMM principles Processor Data Paths L1 cache L2 cache Cache Bank Combine local latch 1 & bank 1 local latch N & bank N Exploit memory hierarchy Off-chip SDRAM Exploit limited life-time Avoid N-port Memories within real-time constraints Reduce redundant transfers Introduce Locality

ASCI winterschool H.C.-P.K.11 DMM steps C-in Preprocessing Dataflow transformations Loop transformations Data reuse Memory hierarchy layer assignment Cycle budget distribution Memory allocation and assignment Data layout C-out Address expression optimization

ASCI winterschool H.C.-P.K.12 The DMM steps Preprocessing –Rewrite code in 3 layers (parts) –Selective inlining, Single Assignment form,.... Data flow transformations –Eliminate redundant transfers and storage Loop and control flow transformations –Improve regularity of accesses and data locality Data re-use and memory hierarchy layer assignment –Determine when to move which data between memories to meet the cycle budget of the application with low cost –Determine in which layer to put the arrays (and copies)

ASCI winterschool H.C.-P.K.13 The DMM steps Per memory layer: Cycle budget distribution –determine memory access constraints for given cycle budget Memory allocation and assignment –which memories to use, and where to put the arrays Data layout –determine how to combine and put arrays into memories Address expression optimizations

ASCI winterschool H.C.-P.K.14 Preprocessing: Dividing an application in the 3 layers Module1a Module1b Module2Module3 Synchronisation - testbench call - dynamic event behaviour - mode selection for (i=0;i<N; i++) for (j=0; j<M; j++) if (i == 0) B[i][j] = 1; else B[i][j] = func1(A[i][j], A[i-1][j]); int func1(int a, int b) { return a*b; } LAYER1 LAYER2 LAYER3

ASCI winterschool H.C.-P.K.15 main(){ /* Layer 1 code */ read_image(IN_NAME, image_in); cav_detect(); write_image(image_out); } void cav_detect() {/* Layer 2 code */ for (x=GB; x<=N-1-GB; ++x) { for (y=GB; y<=M-1-GB; ++y) { gauss_x_tmp = 0; for (k=-GB; k<=GB; ++k) { gauss_x_tmp += in_image[x+k][y] * Gauss[abs(k)]; } gauss_x_image[x][y] = foo(gauss_x_tmp); } Layered code structure

ASCI winterschool H.C.-P.K.16 Layered code structure int foo(int arg1) { /* Layer 3 */ /* arithmetic, data-dependent operations * to be mapped to data-path, controller */ } void cav_detect() {/* Layer 2 code */ for (x=GB; x<=N-1-GB; ++x) { for (y=GB; y<=M-1-GB; ++y) { gauss_x_tmp = 0; for (k=-GB; k<=GB; ++k) { gauss_x_tmp += in_image[x+k][y] * Gauss[abs(k)]; } gauss_x_image[x][y] = foo(gauss_x_tmp); } }/* Makes code for data access */ }/* and data transfer explicit */

ASCI winterschool H.C.-P.K.17 N M Data-flow trafo - cavity detection for (x=0; x<N; ++x) for (y=0; y<M; ++y) gauss_x_image[x][y]=0; for (x=1; x<=N-2; ++x) { for (y=1; y<=M-2; ++y) { gauss_x_tmp = 0; for (k=-1; k<=1; ++k) { gauss_x_tmp += image_in[x+k][y]*Gauss[abs(k)]; } gauss_x_image[x][y] = foo(gauss_x_tmp); } M-2 N-2 #accesses: N * M + (N-2) * (M-2)

ASCI winterschool H.C.-P.K.18 Data-flow trafo - cavity detection N M N-2 M-2 for (x=0; x<N; ++x) for (y=0; y<M; ++y) if ((x>=1 && x<=N-2) && (y>=1 && y<=M-2)) { gauss_x_tmp = 0; for (k=-1; k<=1; ++k) { gauss_x_tmp += image_in[x+k][y]*Gauss[abs(k)]; } gauss_x_image[x][y] = foo(gauss_x_tmp); } else { gauss_x_image[x][y] = 0; } #accesses: N * M gain is ± 50 %

ASCI winterschool H.C.-P.K.19 Data-flow transformation In total 5 types of data-flow transformations: –advanced signal substitution and propagation –algebraic transformations (associativity etc.) –shifting “delay lines” –re-computation –transformations to eliminate bottlenecks for subsequent loop transformations

ASCI winterschool H.C.-P.K.20 Data-flow transformation - result

ASCI winterschool H.C.-P.K.21 Loop transformations –improve regularity of accesses –improve temporal locality: production  consumption Expected influence –reduce temporary storage and (anticipated) background storage storage size N Loop transformations for (j=1; j<=M; j++) for (i=1; i<=N; i++) A[i]= foo(A[i]); for (i=1; i<=N; i++) out[i] = A[i]; for (i=1; i<=N; i++) { for (j=1; j<=M; j++) { A[i] = foo(A[i]); } out[i] = A[i]; } storage size 1

ASCI winterschool H.C.-P.K.22 Data enters Cavity Detector row-wise scan device Buffer serial scan Cavity Detector GaussBlur loop = image_in

ASCI winterschool H.C.-P.K.23 Scanner Loop trafo - cavity detection N x M Gauss Blur x N x M From double buffer to single buffer X Y X-Y Loop Interchange

ASCI winterschool H.C.-P.K.24 Loop trafo - cavity detection Compute Edges Gauss Blur y N x (2GB+1) Repeated fold and loop merge N x 3 From N x M to N x (3) buffer size From N x M to N x (2GB+1) buffer size 2GB+1 3(offset arrays) Gauss Blur x

ASCI winterschool H.C.-P.K.25 for (y=0;y<M;y++) for (x=0;x<N;x++) /* 1st filtering code */ for (y=0;y<M;y++) for (x=0;x<N;x++) /* 2nd filtering code */ for (y=0;y<M;y++) for (x=0;x<N;x++) /* 1st filtering code */ for (x=0;x<N;x++) /* 2nd filtering code */ Improve regularity and locality  Loop Merging !! Impossible due to dependencies!

ASCI winterschool H.C.-P.K.26 Data dependencies between 1st and 2nd loop for (y=0;y<M;y++) for (x=0;x<N;x++) … gauss_x_image[x][y] = … for (y=0;y<M;y++) for (x=0;x<N;x++) … for (k=-GB; k<=GB; k++) … = … gauss_x_image[x][y+k] …

ASCI winterschool H.C.-P.K.27 Enable merging with Loop Folding (bumping) for (y=0;y<M;y++) for (x=0;x<N;x++) … gauss_x_image[x][y] = … for (y=0+GB;y<M+GB;y++) for (x=0;x<N;x++) … y-GB … for (k=-GB; k<=GB; k++) … gauss_x_image[x][y+k-GB] …

ASCI winterschool H.C.-P.K.28 Y-loop merging on 1st and 2nd loop nest for (y=0;y<M+GB;y++) if (y<M) for (x=0;x<N;x++) … gauss_x_image[x][y] = … if (y>=GB) for (x=0;x<N;x++) if (x>=GB && x<=N-1-GB && (y-GB)>=GB && (y-GB)<=M-1-GB) … for (k=-GB; k<=GB; k++) … gauss_x_image[x][y-GB+k] … else

ASCI winterschool H.C.-P.K.29 Simplify conditions in merged loop nest for (y=0;y<M+GB;y++) for (x=0;x<N;x++) if (y<M) … gauss_x_image[x][y] = … for (x=0;x<N;x++) if (y>=GB && x>=GB && x<=N-1-GB && (y-GB)>=GB && (y-GB)<=M-1-GB) … for (k=-GB; k<=GB; k++) … gauss_x_image[x][y-GB+k] … else if (y>=GB)

ASCI winterschool H.C.-P.K.30 End result of global loop trafo for (y=0; y<M+GB+2; ++y) { for (x=0; x<N+2; ++x) { … if (x>=GB && x<=N-1-GB && (y-GB)>=GB && (y-GB)<=M-1-GB) { gauss_xy_compute[x][y-GB][0] = 0; for (k=-GB; k<=GB; ++k) gauss_xy_compute[x][y-GB][GB+k+1] = gauss_xy_compute[x][y-GB][GB+k] + gauss_x_image[x][y-GB+k] * Gauss[abs(k)]; gauss_xy_image[x][y-GB] = gauss_xy_compute[x][y-GB][(2*GB)+1]/tot; } else if (x =0 && (y-GB)<M) gauss_xy_image[x][y-GB] = 0; …

ASCI winterschool H.C.-P.K.31 Loop transformations - result

ASCI winterschool H.C.-P.K.32 #A = 100 P (original) = # access x power/access = 100 Processor Data Paths Reg File M Main memory P = 1 M’ P = 0.1 M’’ P = Data re-use & memory hierarchy Introduce memory hierarchy –reduce number of reads from main memory –heavily accessed arrays stored in smaller memories P (after) = 100 x x x 1 = 3

ASCI winterschool H.C.-P.K.33 Data re-use Data flow transformations to introduce extra copies of heavily accessed signals –Step 1: figure out data re-use possibilities –Step 2: calculate possible gain –Step 3: decide on data assignment to memory hierarchy int[2][6] A; for (h=0; h<N; h++) for (i=0; i<2; i++) for (j=0; j<3; j++) for (k=1; k<7; k++) B[j] = A[i][k]; iterations array index (6 * i + k)

ASCI winterschool H.C.-P.K.34 Data re-use Data flow transformations to introduce extra copies of heavily accessed signals –Step 1: figure out data re-use possibilities –Step 2: calculate possible gain –Step 3: decide on data assignment to memory hierarchy iterations frame1frame2frame3 array index 6*2 6*1 N*2*3*6 CPU 1*2*1*6 N*2*1*6

ASCI winterschool H.C.-P.K.35 Data re-use tree N*M N*1 3*1 image_in M*3 1*3 gauss_x M*3 3*3 gauss_xy/comp_edge M*3 1*1 N*M*3 N*M N*M*3 N*M image_out 0 N*M*8 CPU

ASCI winterschool H.C.-P.K.36 Memory hierarchy assignment L3 L2 L1 N*M 3*1 image_in M*3 gauss_x gauss_xycomp_edgeimage_out 3*3 1*1 3*3 1*1 N*M N*M*3 N*M 0 N*M*3 N*M N*M*3N*M*8 M*3 1MB SDRAM 16KB Cache 128 B RegFile

ASCI winterschool H.C.-P.K.37 Data-reuse - cavity detection code for (y=0; y<M+3; ++y) { for (x=0; x<N+2; ++x) { if (x>=1 && x =1 && y<=M-2) { gauss_x_tmp = 0; for (k=-1; k<=1; ++k) gauss_x_tmp += image_in[x+k][y]*Gauss[abs(k)]; gauss_x_image[x][y] = foo(gauss_x_compute); } else { if (x<N && y<M) gauss_x_lines[x][y] = 0; } /* Other merged code omitted … */ } Code before reuse transformation

ASCI winterschool H.C.-P.K.38 Data-reuse - cavity code for (y=0; y<M+3; ++y) { for (x=0; x<N+2; ++x) { /* first in_pixel initialized */ if (x==0 && y>=1 && y<=M-2) for (k=0; k<1; ++k) in_pixels[(x+k)%3][y%1] = image_in[x+k][y]; /* copy rest of in_pixel's in row */ if (x>=0 && x =1 && y<=M-2) in_pixels[(x+1)%3][y%1] = image_in[x+1][y]; if (x>=1 && x =1 && y<=M-2) { gauss_x_tmp=0; for (k=-1; k<=1; ++k) gauss_x_tmp += in_pixels[(x+k)%3][y%1]*Gauss[Abs(k)]; gauss_x_lines[x][y%3]= foo(gauss_x_tmp); } else if (x<N && y<M) gauss_x_lines[x][y%3] = 0; Code after reuse transformation detection

ASCI winterschool H.C.-P.K.39 Data reuse & memory hierarchy

ASCI winterschool H.C.-P.K.40 Data layout optimization At this point multi-dimensional arrays are to be assigned to physical memories Data layout optimization determines exactly where in each memory an array should be placed, to –reduce memory size by “in-placing” arrays that do not overlap in time (disjoint lifetimes) –to avoid cache misses due to conflicts –exploit spatial locality of the data in memory to improve performance of e.g. page-mode memory access sequences

ASCI winterschool H.C.-P.K.41 In-place mapping B C D A C A D B E C A D E B time E addresses A B C D E Inter in-place Intra in-place

ASCI winterschool H.C.-P.K.42 In-place mapping - results

ASCI winterschool H.C.-P.K.43 The last step: ADOPT (Address OPTimization) Increased execution time introduced by DMM –Complicated address arithmetic (modulo!) –Additional complex control flow Additional transformations needed to –Simplify control flow –Simplify address arithmetic: common sub-expression elimination, modulo expansion, … –Match remaining expressions on target machine

ASCI winterschool H.C.-P.K.44 for (i=-8; i<=8; i++) { for (j=- 4; j<=3; j++) { for (k=- 4; k<=3; k++) { Ad[ ] = A[ ]; }} dist += A[ ]-Ad[ ]; } cse1 = (33025*i )*2; cse3 = 1040+i; cse4 = j* ; cse5 = k+cse4; cse5+cse1 = cse5+cse cse1 ADOPT example for (i=- 8; i<=8; i++) { for (j=- 4; j<=3; j++) { for (k=- 4; k<=3; k++) A[((208+i)*257+8+j)* i+k] = B[(8+j)* i+k]; } dist += A[3096] - B[((208+i)*257+4)* i-4]; } From Full-search Motion Estimation Algebraic transformations at word-level

ASCI winterschool H.C.-P.K.45 Address optimization - result

ASCI winterschool H.C.-P.K.46 Conclusion In embedded applications exploring data transfer and storage issues should be done at system level DTSE is a methodology for Data Transfer and Storage Exploration based on manual and/or tool- assisted code rewriting –Platform independent high-level transformations –Platform dependent transformations exploit platform characteristics (efficient use of cache, local memories) –Substantial reduction in power and memory size demonstrated on MPEG-4, OFDM, H.263, ADSL,...