Embedded Systems in Silicon TD5102 Data Management (3) SCBD, MAA, and Data Layout Henk Corporaal Technical.

Slides:

Advertisements

Similar presentations

Cosc 3P92 Week 9 Lecture slides

Advertisements

Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.

Lecture 8: Memory Hierarchy Cache Performance Kai Bu

Computation I pg 1 Embedded Computer Architecture Memory Hierarchy: Cache Recap Course 5KK73 Henk Corporaal November 2014

Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.

CS2100 Computer Organisation Cache II (AY2014/2015) Semester 2.

Modified from notes by Saeid Nooshabadi COMP3221: Microprocessors and Embedded Systems Lecture 25: Cache - I Lecturer:

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,

The Lord of the Cache Project 3. Caches Three common cache designs: Direct-Mapped store in exactly one cache line Fully Associative store in any cache.

Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.

Multilevel Memory Caches Prof. Sirer CS 316 Cornell University.

Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout.

Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.

Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Bart Mesman Data Memory Management Part d: Data Layout for Caches.

Embedded Systems in Silicon TD5102 Data Management (1) Overview Henk Corporaal Technical University.

331 Lec20.1Fall :332:331 Computer Architecture and Assembly Language Fall 2003 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.

Embedded Systems in Silicon TD5102 Data Management (2) Loop transformations & Data reuse Henk Corporaal

Cache Memories May 5, 2008 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance EECS213.

1  Caches load multiple bytes per block to take advantage of spatial locality  If cache block size = 2 n bytes, conceptually split memory into 2 n -byte.

331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.

Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.

Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Data Management Part c: SCBD, MAA, and Data Layout.

Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.

Caches – basic idea Small, fast memory Stores frequently-accessed blocks of memory. When it fills up, discard some blocks and replace them with others.

Systems I Locality and Caching

EECS 370 Discussion 1 xkcd.com. EECS 370 Discussion Topics Today: – Caches!! Theory Design Examples 2.

Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.

Multilevel Memory Caches Prof. Sirer CS 316 Cornell University.

How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.

10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.

CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.

CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and

ASCI Winterschool on Embedded Systems March 2004 Renesse Data Memory Management Henk Corporaal Peter Knijnenburg.

ECE 454 Computer Systems Programming Memory performance (Part II: Optimizing for caches) Ding Yuan ECE Dept., University of Toronto

1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=

C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.

System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.

Lecture 08: Memory Hierarchy Cache Performance Kai Bu

COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.

SOFTENG 363 Computer Architecture Cache John Morris ECE/CS, The University of Auckland Iolanthe I at 13 knots on Cockburn Sound, WA.

CACHE MEMORY CS 147 October 2, 2008 Sampriya Chandra.

COMPSYS 304 Computer Architecture Cache John Morris Electrical & Computer Enginering/ Computer Science, The University of Auckland Iolanthe at 13 knots.

1 Memory Hierarchy Design Chapter 5. 2 Cache Systems CPUCache Main Memory Data object transfer Block transfer CPU 400MHz Main Memory 10MHz Bus 66MHz CPU.

CSCI206 - Computer Organization & Programming

CS 704 Advanced Computer Architecture

CSE 351 Section 9 3/1/12.

Memory COMPUTER ARCHITECTURE

CS161 – Design and Architecture of Computer

Associativity in Caches Lecture 25

CSC 4250 Computer Architectures

How will execution time grow with SIZE?

The Hardware/Software Interface CSE351 Winter 2013

Cache Memory Presentation I

CS61C : Machine Structures Lecture 6. 2

Embedded Computer Architecture

Lecture 23: Cache, Memory, Virtual Memory

Lecture 08: Memory Hierarchy Cache Performance

Adapted from slides by Sally McKee Cornell University

ECE232: Hardware Organization and Design

EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007

Siddhartha Chatterjee

Memory Operation and Performance

Caches III CSE 351 Autumn 2018 Instructor: Justin Hsia

CS 3410, Spring 2014 Computer Science Cornell University

Chapter 1 Computer System Overview

10/18: Lecture Topics Using spatial locality

Platform-based Design

Caches III CSE 351 Spring 2019 Instructor: Ruth Anderson

Overview Problem Solution CPU vs Memory performance imbalance

Presentation transcript:

Embedded Systems in Silicon TD5102 Data Management (3) SCBD, MAA, and Data Layout Henk Corporaal Technical University Eindhoven DTI / NUS Singapore 2005/2006

H.C. TD51022 Part 3 overview Recap on design flow Platform dependent steps –SCBD: Storage Cycle Budget Distribution –MAA: Memory Allocation and Assignment –Data layout techniques for RAM –Data layout techniques for Caches Results Conclusions Thanks to the IMEC DTSE people

H.C. TD51023 Dynamic memory mgmt Task concurrency mgmt Physical memory mgmt Address optimization SWdesignflowHWdesignflow SW/HW co-design SW/HW co-design Concurrent OO spec Remove OO overhead

H.C. TD51024 DM steps C-in Preprocessing Dataflow transformations Loop transformations Data reuse Memory hierarchy layer assignment Cycle budget distribution Memory allocation and assignment Data layout C-out Address optimization

H.C. TD51025 Result of Memory hierarchy assignment for cavity detection L3 L2 L1 N*M 3*1 image_in M*3 gauss_x gauss_xycomp_edgeimage_out 3*3 1*1 3*3 1*1 N*M N*M*3 N*M 0 N*M*3 N*M N*M*3N*M*8 M*3 1MB SDRAM 16KB Cache 128 B RegFile

H.C. TD51026 Data-reuse - cavity detection code for (y=0; y<M+3; ++y) { for (x=0; x<N+2; ++x) { /* first in_pixel initialized */ if (x==0 && y>=1 && y<=M-2) in_pixels[x%3] = image_in[x][y]; /* copy rest of in_pixel's in row */ if (x>=0 && x =1 && y<=M-2) in_pixels[(x+1)%3]= image_in[x+1][y]; if (x>=1 && x =1 && y<=M-2) { gauss_x_tmp=0; for (k=-1; k<=1; ++k) gauss_x_tmp += in_pixels[(x+k)%3]*Gauss[Abs(k)]; gauss_x_lines[x][y%3]= foo(gauss_x_tmp); } else if (x<N && y<M) gauss_x_lines[x][y%3] = 0; Code after reuse transformation (partly)

Storage Cycle Budget Distribution & Memory Allocation and Assignment

H.C. TD51028 Define the memory organization which can provide enough bandwidth with minimal cost

H.C. TD51029 Lower required performance by balancing bandwidth Memory Bandwidth Required Memory Bandwidth Required time High Low Reduce max. number of loads/store per cycle

H.C. TD Data management approach One of the many possible schedules

H.C. TD Data management approach

H.C. TD Conflict cost calculation Self conflicts Chromatic number Number of conflicts

H.C. TD Self conflict  dual port memory

H.C. TD Chromatic number  minimum # single port memories

H.C. TD Low number of conflicts  large assignment freedom

H.C. TD time slots ? R(C) W(B) R(B) W(A) R(A) R(C) W(C) R(D) W(D) W(A) W(C) R(C) W(B) Conflict Directed Ordering is used for flat graph scheduling Reduce intervals until all conflicts known Driven by cost of conflicts Constructive algorithm

H.C. TD Local optimization is not good for global optimization

H.C. TD Budget distribution has large impact on memory cost

H.C. TD Decreasing basic block length until target cycle budget is met

H.C. TD Obtain more freedom by merging loops More scheduling freedom Extension to different threads

H.C. TD Memory allocation and assignment

H.C. TD Memory Allocation and Assignment Substeps Array-to-memory Assignment D C A B Port Assignment Bus Sharing D C A B Memory Allocation 123

H.C. TD Influence of MAA Bitwidth Address range Nr. memories Nr. ports Assign arrays to memory Memory interconnect Minimize power & Area Bitwidth (maximum) Size Nr. ports (R/W/RW) MEMORY-1 A B Bitwidth (maximum) Size Nr. ports (R/W/RW) MEMORY-N K L XXXX 1001XXXXXX

H.C. TD Trade-offs in the physical memory Area Power Area Power Trade off area and power for required bandwidth A B C D A C D B

H.C. TD Example of bus sharing possibilities R(A)R(B) R(B)W(A) W(C)R(A) R(A)W(B) W(A)W(B) W(A)W(C) m1m2m3 AB X X C m1m2m3 ABC m1m2m3 AB X C

H.C. TD Decreasing cycle budget limits freedom and raises cost

H.C. TD Resulting Pareto curve for DAB synchro application

H.C. TD Example conflict graph for cavity detection

H.C. TD MAA result Power: On-chip area:

H.C. TD Data layout how to put data into memory

H.C. TD A C ? ? B MEM1 F G ? ? H MEM2 PE A’ B’ ? ? CACHE PE A B CACHE A C MEM1 B F MEM2 G H Memory data layout for custom and cache architectures C A B C B

H.C. TD for (i=1; i<5; i++) for (j=0; j<5; j++) a[i][j] = f(a[i-1][j]); i-1 i j Window Intra-array in-place mapping reduces size of one array a time max nr. of life elements

H.C. TD variable domains abstract addresses real addresses aAaA a C A B aCaC aBaB Two-phase mapping of array elements onto addresses Storage order Allocation

H.C. TD a2a2 a1a1 a=3a 1 +a 2 a=3(1-a 1 )+a 2 a=3a 1 +(2-a 2 ) a=3(1-a 1 )+(2-a 2 ) a=2a 2 +a 1 a=2a 2 +(1-a 1 )a=2(2-a 2 )+(1-a 1 ) a=2(2-a 2 )+a 1 a a=??? memory address variable domain Exploration of storage orders for 2-dimensional array ??????

H.C. TD Chosen storage order determines window size for (i=1; i<5; i++) for (j=0; j<5; j++) a[i][j] = f(a[i-1][j]); row-major ordering: a=5i+j for (i=1; i<5; i++) for (j=0; j<5; j++) a[5*i+j] = f(a[5*i+j-5]); column-major: a=5j+i for (i=1; i<5; i++) for (j=0; j<5; j++) a[5*j+i] = f(a[5*j+i-1]); Highest live address: Lowest live address: 5*i+j 5*i+j-5 5*4+i-1 5*0+i-1 Difference + 1= Window: 6 21

H.C. TD A B C D E Memory Size Static allocation: no in-place mapping E aEaE C aCaC A aAaA D aDaD B aBaB

H.C. TD C Memory Size A D B E Static, windowed C Memory Size A D B E Dynamic, windowed Windowed Allocation: intra-array in-place mapping

H.C. TD Dynamic allocation: inter-array in-place mapping E aEaE C aCaC A aAaA D aDaD B aBaB A B CD E Memory Size

H.C. TD A B C E D A C E D B Memory Size Dynamic, common window Dynamic allocation strategy with common window

H.C. TD Before: bit8 B[10][20]; bit6 A[30]; for(x=0;x<10;++x) for (y=0;y<20;++y) … = A[3*x-y]; B[x][y] = …; After: bit8 memory[334]; bit8* B =(bit8*)&memory[134]; bit6* A =(bit6*)&memory[120]; for(x=0;x<10;++x) for (y=0;y<20;++y) … = A[3*x-y]; B[(x*20+y*2)%78] = …; Expressing memory data layout in source code Example: array of 10x20 elements A: offset 120, no window B: storage order [20, 2], offset 134, window 78

H.C. TD int x[W], y[W]; for (i1=0; i1 < W; i1++) x[i1] = getInput(); for (i2=0; i2 < W; i2++) { sum = 0; for (di2=-N; di2 <=N; di2++) { sum += c[N+di2] * x[wrap(i2+di2,W)]; } y[i2] = sum; } for (i3=0; i3 < W; i3++) putOutput(y[i3]); Example of memory data layout for storage size reduction

H.C. TD Occupied address-time domain of x[] and y[]

H.C. TD int mem1[N+W]; for (i1=0; i1 < W; i1++) mem1[N+i1] = getInput(); for (i2=0; i2 < W; i2++) { sum = 0; for (di2=-N; di2 <=N; di2++) { sum += c[N+di2] * mem1[N+wrap(i2+di2,W)]; } mem1[i2] = sum; } for (i3=0; i3 < W; i3++) putOutput(mem1[i3]); Optimized source code after memory data layout

H.C. TD Optimized OAT domain after memory data layout

H.C. TD In-place mapping for cavity detection example Input image is partly consumed by the time first results for output image are ready index time Image_in time address Image time index Image_out

H.C. TD In-place - cavity detection code for (y=0; y<=M+3; ++y) { for (x=0; x<N+5; ++x) { image_out[x-5][y-3] = …; /* code removed */ … = image_in[x+1][y]; } for (y=0; y<=M+3; ++y) { for (x=0; x<N+5; ++x) { image[x-5][y-3] = …; /* code removed */ … = image [x+1][y]; }

H.C. TD Cavity detection summary Overall result: Local accesses reduced by factor 3 Memory size reduced by factor 5 Power reduced by factor 5 System bus load reduced by factor 12 Performance worsened by factor 6

H.C. TD Data layout for caches Caches are hardware controled Therefore: no explicit copy coded needed ! What can we do ?

H.C. TD p-k-mmk tagindex address byte address tagdata Hit? main memory CPU 2 k lines p-k-m2 m bytes Cache line / Block Cache principles

H.C. TD Cache Architecture Fundamentals Block placement –Where in the cache will a new block be placed? Block identification –How is a block found in the cache? Block replacement policy –Which block is evicted from the cache? Updating policy –How is a block written from cache to memory?

H.C. TD510251Cache Fully associative (one-to-many) Anywhere in cache Here only! Direct mapped (one-to-one) Here only! Memory Mapping?... Block placement policies

H.C. TD Direct mapped cache Byte offset ValidTagDataIndex Tag Index HitData Address (bit positions)

H.C. TD Taking advantage of spatial locality: Direct mapped cache: larger blocks Address (bit positions)

H.C. TD Increasing the block size tends to decrease miss rate: Performance

H.C. TD way associative cache

H.C. TD Performance 1 KB 2 KB 8 KB

H.C. TD Cache Fundamentals The “Three C's” Compulsory Misses –1st access to a block: never in the cache Capacity Misses –Cache cannot contain all the blocks –Blocks are discarded and retrieved later –Avoided by increasing cache size Conflict Misses –Too many blocks mapped to same set –Avoided by increasing associativity

H.C. TD for(i=0; i<10; i++) A[i] = f(B[i]); i=2) A[0] B[1] B[2] B[0] A[1] A[2] --- B[3], A[3] required B[3] never loaded before  loaded into cache A[3] never loaded before  allocates new line i=3) Compulsory miss example

H.C. TD Capacity miss example B[3] B[0] A[0] i=0 B[3] B[0] A[0] B[4] B[1] A[1] i=1 A[2] B[0] A[0] B[4] B[1] A[1] B[5] B[2] i=2 A[2] B[6] B[3] A[3] B[1] A[1] B[5] B[2] i=3 A[2] B[6] B[3] A[3] B[7] B[4] A[4] B[2] i=4 B[5] A[5] B[3] A[3] B[7] B[4] A[4] B[8] i=5 B[5] A[5] B[9] B[6] A[6] B[4] A[4] B[8] i=6 for(i=0; i<N; i++) A[i] = B[i+3]+B[i]; B[5] A[5] B[9] B[6] A[6] B[10] B[7] A[7] i=7 11 compulsory misses (+8 write misses) 5 capacity misses Cache size: 8 blocks of 1 word Fully associative

H.C. TD Cache i=0) B[0][j] A[0]/B[0][j] 0 for(j=0; j<10; j++) for(i=0; i<4; i++) A[i] = A[i]+B[i][j]; A[0] 0 A[1] 1 A[2] B[3][9] B[3][0] B[0][1] A[3] B[0][0] B[1][0] B[1][1] B[2][0] B[2][1] B[3][1] 12 B[0][2] B[1][2] 13 B[2][2] B[3][2] B[0][3] 0... Memory address Cache address j=even A[0] multiply loaded A[i] multiple x read -> A[0] flushed in favor B[0][j] -> Miss j=odd Conflict miss example

H.C. TD “Three C's” vs Cache size [Gee93]

Data layout may reduce cache misses

H.C. TD Example 1: Capacity & Compulsory miss reduction B[3] B[0] A[0] i=0 B[3] B[0] A[0] B[4] B[1] A[1] i=1 A[2] B[0] A[0] B[4] B[1] A[1] B[5] B[2] i=2 A[2] B[6] B[3] A[3] B[1] A[1] B[5] B[2] i=3 A[2] B[6] B[3] A[3] B[7] B[4] A[4] B[2] i=4 B[5] A[5] B[3] A[3] B[7] B[4] A[4] B[8] i=5 B[5] A[5] B[9] B[6] A[6] B[4] A[4] B[8] i=6 for(i=0; i<N; i++) A[i] = B[i+3]+B[i]; B[5] A[5] B[9] B[6] A[6] B[10] B[7] A[7] i=7 11 compulsory misses (+8 write misses) 5 capacity misses

H.C. TD #Words B[] i 60 Cache Memory Main Memory (16 words) AB[new] Fit data in cache with in-place mapping A[] 15 Detailed Analysis: max=15 words 12 for(i=0; i<12; i++) A[i] = B[i+3]+B[i]; Traditional Analysis: max=27 words

H.C. TD Remove capacity / compulsory misses with in-place mapping AB[3] AB[0] i=0 AB[3] AB[0] AB[4] AB[1] i=1 AB[3] AB[0] AB[4] AB[1] AB[5] AB[2] i=2 AB[3] AB[0] AB[4] AB[1] AB[5] AB[2] AB[6] i=3 AB[3] AB[0] AB[4] AB[1] AB[5] AB[2] AB[6] AB[7] i=4 AB[3] AB[8] AB[4] AB[1] AB[5] AB[2] AB[6] AB[7] i=5 AB[3] AB[8] AB[4] AB[9] AB[5] AB[2] AB[6] AB[7] i=6 for(i=0; i<N; i++) AB[i] = AB[i+3]+AB[i]; AB[7] AB[8] AB[4] AB[9] AB[5] AB[10] AB[6] AB[7] i=7 11 compulsory misses 5 cache hits (+8 write hits)

H.C. TD Cache i=0) B[0][j] A[0]/B[0][j] 0 for(j=0; j<10; j++) for(i=0; i<4; i++) A[i] = A[i]+B[i][j]; A[0] 0 A[1] 1 A[2] B[3][9] B[3][0] B[0][1] A[3] B[0][0] B[1][0] B[1][1] B[2][0] B[2][1] B[3][1] 12 B[0][2] B[1][2] 13 B[2][2] B[3][2] B[0][3] 0... Memory address Cache address j=even A[0] multiply loaded A[i] multiple x read -> A[0] flushed in favor B[0][j] -> Miss j=odd Example 2: Conflict miss reduction

H.C. TD for(j=0; j<10; j++) for(i=0; i<4; i++) for(i=0; i<4; i++) A[i] = A[i]+B[i][j]; A[i] = A[i]+B[i][j]; A[0] 0 A[1] 1 A[2] B[3][9] B[3][0] B[0][1] Main Memory A[3] B[0][0] B[1][0] B[1][1] B[2][0] Leave gap B[2][1] B[3][1] B[0][2] B[0][j] A[0] 0 A[0] multiply loaded A[i] multiple x read No conflict Cache i=0) j=any © imec 2001 Avoid conflict miss with main memory data layout

H.C. TD Data Layout Organization for Direct Mapped Caches

H.C. TD Conclusion on Data Management In multi-media applications exploring data transfer and storage issues should be done at source code level DMM method –Reducing number of external memory accesses –Reducing external memory size –Trade-offs between internal memory complexity and speed –Platform independent high-level transformations –Platform dependent transformations exploit platform characteristics (efficient use of memory, cache, …) –Substantial energy reduction