Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Data Management Part c: SCBD, MAA, and Data Layout.

Slides:

Advertisements

Similar presentations

CSCI 4717/5717 Computer Architecture

Advertisements

Computer Architecture Lecture 7 Compiler Considerations and Optimizations.

Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.

Computation I pg 1 Embedded Computer Architecture Memory Hierarchy: Cache Recap Course 5KK73 Henk Corporaal November 2014

1 Code Optimization Code produced by compilation algorithms can often be improved (ideally optimized) in terms of run-time speed and the amount of memory.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Memory Management Part b: Loop transformations & Data Reuse.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Embedded Systems in Silicon TD5102 Data Management (3) SCBD, MAA, and Data Layout Henk Corporaal Technical.

Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman Data Memory Management Part b: Loop transformations.

Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout.

Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.

Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Bart Mesman Data Memory Management Part d: Data Layout for Caches.

Embedded Systems in Silicon TD5102 Data Management (1) Overview Henk Corporaal Technical University.

1 Computer System Overview OS-1 Course AA

Embedded Systems in Silicon TD5102 Data Management (2) Loop transformations & Data reuse Henk Corporaal

Computer System Overview

© imec 2001 ARRM’01, Oct.17 Managing dynamic concurrent tasks in real-time multi-media systems Francky Catthoor, IMEC, Belgium.

Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Bart Mesman Data Memory Management Part b: Loop transformations & Data Reuse.

Register Allocation and Spilling via Graph Coloring G. J. Chaitin IBM Research, 1982.

- 1 -  P. Marwedel, Univ. Dortmund, Informatik 12, 2003 Universität Dortmund Actual design flows and tools.

Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.

Optimizing Loop Performance for Clustered VLIW Architectures by Yi Qian (Texas Instruments) Co-authors: Steve Carr (Michigan Technological University)

Chapter 1 Computer System Overview Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E William.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

Automated Design of Custom Architecture Tulika Mitra

Sub-expression elimination Logic expressions: –Performed by logic optimization. –Kernel-based methods. Arithmetic expressions: –Search isomorphic patterns.

Chapter 5 Large and Fast: Exploiting Memory Hierarchy CprE 381 Computer Organization and Assembly Level Programming, Fall 2013 Zhao Zhang Iowa State University.

CSc 453 Final Code Generation Saumya Debray The University of Arizona Tucson.

Operating systems, lecture 4 Team Viewer Tom Mikael Larsen, Thursdays in D A look at assignment 1 Brief rehearsal from lecture 3 More about.

Storage Allocation for Embedded Processors By Jan Sjodin & Carl von Platen Present by Xie Lei ( PLS Lab)

1 2-Hardware Design Basics of Embedded Processors (cont.)

CPEN Digital System Design

ASCI Winterschool on Embedded Systems March 2004 Renesse Data Memory Management Henk Corporaal Peter Knijnenburg.

1 Optimizing compiler tools and building blocks project Alexander Drozdov, PhD Sergey Novikov, PhD.

Design Issues. How to parallelize  Task decomposition  Data decomposition  Dataflow decomposition Jaruloj Chongstitvatana 2 Parallel Programming: Parallelization.

Master Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Systems and Enabling Platforms Marco Vanneschi 1.Prerequisites.

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.

Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.

OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.

Retargetting of VPO to the tms320c54x - a status report Presented by Joshua George Advisor: Dr. Jack Davidson.

Static Process Scheduling

1 of 14 Lab 2: Formal verification with UPPAAL. 2 of 14 2 The gossiping persons There are n persons. All have one secret to tell, which is not known to.

† Dept. Computer Science and Engineering The Pennsylvania State University ‡ IMEC, Belgium Estimating Influence of Data Layout Optimizations on SDRAM Energy.

1 Memory Hierarchy Design Chapter 5. 2 Cache Systems CPUCache Main Memory Data object transfer Block transfer CPU 400MHz Main Memory 10MHz Bus 66MHz CPU.

©SoftMoore ConsultingSlide 1 Code Optimization. ©SoftMoore ConsultingSlide 2 Code Optimization Code generation techniques and transformations that result.

CS 704 Advanced Computer Architecture

Code Optimization Code produced by compilation algorithms can often be improved (ideally optimized) in terms of run-time speed and the amount of memory.

Code Optimization Overview and Examples

Memory Management.

Chapter 1 Computer System Overview

Register Transfer Specification And Design

The Hardware/Software Interface CSE351 Winter 2013

Optimization Code Optimization ©SoftMoore Consulting.

Cache Memory Presentation I

Embedded Computer Architecture

Florin Balasa University of Illinois at Chicago

Register Pressure Guided Unroll-and-Jam

Architectural-Level Synthesis

Lecture 16: Register Allocation

Chapter 1 Computer System Overview

Main Memory Background

Parallel Programming in C with MPI and OpenMP

CSc 453 Final Code Generation

Platform-based Design

Overview Problem Solution CPU vs Memory performance imbalance

Presentation transcript:

Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Data Management Part c: SCBD, MAA, and Data Layout

Embedded Computer Architecture Part 3 overview Recap on design flow Platform dependent steps –SCBD: Storage Cycle Budget Distribution –MAA: Memory Allocation and Assignment –Data layout techniques for RAM –Data layout techniques for Caches Results Conclusions Thanks to the IMEC DTSE people

Embedded Computer Architecture Dynamic memory mgmt Task concurrency mgmt Physical memory mgmt Address optimization SWdesignflowHWdesignflow SW/HW co-design SW/HW co-design Concurrent OO spec Remove OO overhead

Embedded Computer Architecture DM steps C-in Preprocessing Dataflow transformations Loop transformations Data reuse Memory hierarchy layer assignment Cycle budget distribution Memory allocation and assignment Data layout C-out Address optimization Today

Embedded Computer Architecture Result of Memory hierarchy assignment for cavity detection L2 L1 L0 N*M 3*1 image_in M*3 gauss_x gauss_xycomp_edgeimage_out 3*3 1*1 3*3 1*1 N*M N*M*3 N*M 0 N*M*3 N*M N*M*3N*M*8 M*3 1MB SDRAM 16KB Cache 128 B RegFile

Embedded Computer Architecture Data-reuse - cavity detection code for (y=0; y<M+3; ++y) { for (x=0; x<N+2; ++x) { /* first in_pixels initialized */ if (x==0 && y>=1 && y<=M-2) in_pixels[x%3] = image_in[x][y]; /* copy rest of in_pixel's in row */ if (x>=0 && x =1 && y<=M-2) in_pixels[(x+1)%3]= image_in[x+1][y]; if (x>=1 && x =1 && y<=M-2) { gauss_x_tmp=0; for (k=-1; k<=1; ++k) // 3x1 filter gauss_x_tmp += in_pixels[(x+k)%3]*Gauss[Abs(k)]; gauss_x_lines[x][y%3]= foo(gauss_x_tmp); } else if (x<N && y<M) gauss_x_lines[x][y%3] = 0; Code after reuse transformation (partly)

Storage Cycle Budget Distribution & Memory Allocation and Assignment

Embedded Computer Architecture Define the memory organization which can provide enough bandwidth with minimal cost

Embedded Computer Architecture Balancing memory bandwidth Reduce max. number of loads/store per cycle: Memory Bandwidth Required time High Memory Bandwidth Required time Low

Embedded Computer Architecture Data management approach One of the many possible schedules Idea: find a schedule which fits in the number of cycles (= budget) reduces the number of ports avoids multi-ported memories

Embedded Computer Architecture Data management approach; details

Embedded Computer Architecture Conflict cost calculation Key issues: Number of conflicts Self conflicts Chromatic number = size of maximum clique

Embedded Computer Architecture Self conflict  dual port memory Reschedule

Embedded Computer Architecture Chromatic number  minimum # single port memories Reschedule

Embedded Computer Architecture Lower number of conflicts  larger assignment freedom Reschedule

Embedded Computer Architecture time slots ? R(C) W(B) R(B) W(A) R(A) R(C) W(C) R(D) W(D) W(A) W(C) R(C) W(B) Conflict Directed Ordering is used to find a good schedule Reduce intervals until all conflicts known Driven by cost of conflicts Constructive algorithm

Embedded Computer Architecture Local optimization is not good for global optimization

Embedded Computer Architecture Budget distribution has large impact on memory cost

Embedded Computer Architecture Decreasing basic block length until target cycle budget is met

Embedded Computer Architecture What's the effect of merging loops? More scheduling freedom !! Reschedule

Embedded Computer Architecture Memory allocation and assignment

Embedded Computer Architecture Memory Allocation and Assignment Substeps Array-to-memory Assignment D C A B Port Assignment Bus Sharing D C A B Memory Allocation 123 Allocation = Select number and type of memories

Embedded Computer Architecture Influence of MAA Bit width Address range Nr. memories Nr. ports Assign arrays to memory Memory interconnect Minimize power & Area Bitwidth (maximum) Size Nr. ports (R/W/RW) MEMORY-1 A B Bitwidth (maximum) Size Nr. ports (R/W/RW) MEMORY-N K L XXXX 1001XXXXXX

Embedded Computer Architecture Example of bus sharing possibilities R(A)R(B) R(B)W(A) W(C)R(A) R(A)W(B) W(A)W(B) W(A)W(C) m1m2m3 AB X X C m1m2m3 ABC m1m2m3 AB X C Given Schedule

Embedded Computer Architecture Decreasing cycle budget limits freedom and raises cost

Embedded Computer Architecture Example: Resulting Pareto curve for DAB synchro application Energy cost

Embedded Computer Architecture Example conflict graph for cavity detection

Embedded Computer Architecture MAA result Power: On-chip area:

Embedded Computer Architecture Data layout how to put data into memory

Embedded Computer Architecture A C ? ? B MEM1 F G ? ? H MEM2 PE A' B' ? ? CACHE Memory data layout for custom and cache architectures PE A' B' CACHE A C MEM1 B F MEM2 G H C A B C B

Embedded Computer Architecture for (i=1; i<5; i++) for (j=0; j<5; j++) a[i][j] = f(a[i-1][j]); i-1 i j Window Intra-array in-place mapping reduces size of one array a ij time max nr. of life elements This number depends on the layout !! Compare e.g. row major and column major ordering. memory addresses

Embedded Computer Architecture array domains C A B Two-phase mapping of array elements onto addresses abstract addresses aAaA aCaC aBaB Storage order real addresses a Allocation

Embedded Computer Architecture a a=??? memory address variable domain Exploration of storage orders for 2-dimensional array: 8 options a2a2 a1a1 ?????? a=3a 1 +a 2 a=3(1-a 1 )+a 2 a=3a 1 +(2-a 2 ) a=2a 2 +a 1 a=2a 2 +(1-a 1 ) a=2(2-a 2 )+a 1 a=3(1-a 1 )+(2-a 2 ) a=2(2-a 2 )+(1-a 1 )

Embedded Computer Architecture Chosen storage order determines window size for (i=1; i<5; i++) for (j=0; j<5; j++) a[i][j] = f(a[i-1][j]); row-major ordering: a=5i+j for (i=1; i<5; i++) for (j=0; j<5; j++) a[5*i+j] = f(a[5*i+j-5]); Highest live address: Lowest live address: 5*i+j 5*i+j-5 Difference + 1= Window: 6 column-major: a=5j+i for (i=1; i<5; i++) for (j=0; j<5; j++) a[5*j+i] = f(a[5*j+i-1]); 5*4+i-1 5*0+i-1 21 j i

Embedded Computer Architecture A B C D E Memory Size Static allocation: no in-place mapping E aEaE C aCaC A aAaA D aDaD B aBaB time

Embedded Computer Architecture C Memory Size A D B E Static, windowed C Memory Size A D B E Dynamic, windowed Windowed Allocation: intra-array in-place mapping WAWA

Embedded Computer Architecture Dynamic allocation: inter-array in-place mapping E aEaE C aCaC A aAaA D aDaD B aBaB A B C D E Memory Size

Embedded Computer Architecture A B C E D A C E D B Memory Size Dynamic, common window Dynamic allocation strategy with common window

Embedded Computer Architecture Before: bit8 B[10][20]; bit6 A[30]; for(x=0;x<10;++x) for (y=0;y<20;++y) … = A[3*x-y]; B[x][y] = …; After: bit8 memory[334]; bit8* B =(bit8*)&memory[134]; bit6* A =(bit6*)&memory[120]; for(x=0;x<10;++x) for (y=0;y<20;++y) … = A[3*x-y]; B[(x*20+y*2)%78] = …; Expressing memory data layout in source code Example: array of 10x20 elements A: offset 120, no window B: storage order [20, 2], offset 134, window 78

Embedded Computer Architecture int x[W], y[W]; for (i1=0; i1 < W; i1++) x[i1] = getInput(); for (i2=0; i2 < W; i2++) { sum = 0; for (di2=-N; di2 <=N; di2++) { sum += c[N+di2] * x[wrap(i2+di2,W)]; } y[i2] = sum; } for (i3=0; i3 < W; i3++) putOutput(y[i3]); Example of memory data layout for storage size reduction

Embedded Computer Architecture Occupied address-time domain of x[] and y[]

Embedded Computer Architecture int mem1[N+W]; for (i1=0; i1 < W; i1++) mem1[N+i1] = getInput(); for (i2=0; i2 < W; i2++) { sum = 0; for (di2=-N; di2 <=N; di2++) { sum += c[N+di2] * mem1[N+wrap(i2+di2,W)]; } mem1[i2] = sum; } for (i3=0; i3 < W; i3++) putOutput(mem1[i3]); Optimized source code after memory data layout

Embedded Computer Architecture Optimized OAT domain after memory data layout

Embedded Computer Architecture In-place mapping for cavity detection example Input image is partly consumed by the time first results for output image are ready index time Image_in time address Image time index Image_out

Embedded Computer Architecture In-place - cavity detection code for (y=0; y<=M+3; ++y) { for (x=0; x<N+5; ++x) { image_out[x-5][y-3] = …; /* code removed */ … = image_in[x+1][y]; } for (y=0; y<=M+3; ++y) { for (x=0; x<N+5; ++x) { image[x-5][y-3] = …; /* code removed */ … = image [x+1][y]; }

Embedded Computer Architecture Cavity detection summary Overall result: Local accesses reduced by factor 3 Memory size reduced by factor 5 Power reduced by factor 5 System bus load reduced by factor 12 Performance worsened by factor 6

Embedded Computer Architecture The last step: ADOPT (Address OPTimization) Increased execution time introduced by DTSE –Complicated address arithmetic (modulo: a%b ) –Additional complex control flow Additional transformations needed to –Simplify control flow –Simplify address arithmetic: common subexpression elimination, modulo expansion, … –Match remaining expressions on target machine

Embedded Computer Architecture ADOPT principles How to avoid % in address expressions, like int A[7]; for (i=0; i<… ; i++) … A[i % 7] Increase buffer size to power of 2 i % 8 => i && 0x07 Use if-statement int A[7]; for (i=0,j=0; i<… ; i++,j++) … A[j] if (j==8) j=0

Embedded Computer Architecture for (i=-8; i<=8; i++) { for (j=- 4; j<=3; j++) { for (k=- 4; k<=3; k++) { B[ ] = A[ ]; }} dist += A[ ]- B[ ]; } cse1 = (33025*i )*2; cse3 = 1040+i; cse4 = j* ; cse5 = k+cse4; cse5+cse1 = cse5+cse cse1 ADOPT principles: CSE for (i=- 8; i<=8; i++) { for (j=- 4; j<=3; j++) { for (k=- 4; k<=3; k++) A[((208+i)*257+8+j)* i+k] = B[(8+j)* i+k]; } dist += A[3096] - B[((208+i)*257+4)* i-4]; } Example: Full-search Motion Estimation - applying Common Subexpression Elimination (CSE) Algebraic transformations at word-level

Embedded Computer Architecture Conclusion on Data Management In multi-media applications exploring data transfer and storage issues should be done at source code level DMM method –Reducing number of external memory accesses –Reducing external memory size –Trade-offs between internal memory complexity and speed –Platform independent high-level transformations –Platform dependent transformations exploit platform characteristics (efficient use of memory, cache, …) –Substantial energy reduction