Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout.

Slides:



Advertisements
Similar presentations
Cosc 3P92 Week 9 Lecture slides
Advertisements

Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Computation I pg 1 Embedded Computer Architecture Memory Hierarchy: Cache Recap Course 5KK73 Henk Corporaal November 2014
Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.
Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Memory Management Part b: Loop transformations & Data Reuse.
CS2100 Computer Organisation Cache II (AY2014/2015) Semester 2.
Modified from notes by Saeid Nooshabadi COMP3221: Microprocessors and Embedded Systems Lecture 25: Cache - I Lecturer:
Embedded Systems in Silicon TD5102 Data Management (3) SCBD, MAA, and Data Layout Henk Corporaal Technical.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,
The Lord of the Cache Project 3. Caches Three common cache designs: Direct-Mapped store in exactly one cache line Fully Associative store in any cache.
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
Multilevel Memory Caches Prof. Sirer CS 316 Cornell University.
Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman Data Memory Management Part b: Loop transformations.
Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.
Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Bart Mesman Data Memory Management Part d: Data Layout for Caches.
Embedded Systems in Silicon TD5102 Data Management (1) Overview Henk Corporaal Technical University.
331 Lec20.1Fall :332:331 Computer Architecture and Assembly Language Fall 2003 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
ENGS 116 Lecture 121 Caches Vincent H. Berk Wednesday October 29 th, 2008 Reading for Friday: Sections C.1 – C.3 Article for Friday: Jouppi Reading for.
Computer System Overview
Cache Memories May 5, 2008 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance EECS213.
1  Caches load multiple bytes per block to take advantage of spatial locality  If cache block size = 2 n bytes, conceptually split memory into 2 n -byte.
331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Bart Mesman Data Memory Management Part b: Loop transformations & Data Reuse.
Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.
Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Data Management Part c: SCBD, MAA, and Data Layout.
Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.
Caches – basic idea Small, fast memory Stores frequently-accessed blocks of memory. When it fills up, discard some blocks and replace them with others.
Systems I Locality and Caching
EECS 370 Discussion 1 xkcd.com. EECS 370 Discussion Topics Today: – Caches!! Theory Design Examples 2.
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
Operating systems, lecture 4 Team Viewer Tom Mikael Larsen, Thursdays in D A look at assignment 1 Brief rehearsal from lecture 3 More about.
Multilevel Memory Caches Prof. Sirer CS 316 Cornell University.
How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and
ASCI Winterschool on Embedded Systems March 2004 Renesse Data Memory Management Henk Corporaal Peter Knijnenburg.
ECE 454 Computer Systems Programming Memory performance (Part II: Optimizing for caches) Ding Yuan ECE Dept., University of Toronto
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
CS2100 Computer Organisation Cache II (AY2015/6) Semester 1.
COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.
SOFTENG 363 Computer Architecture Cache John Morris ECE/CS, The University of Auckland Iolanthe I at 13 knots on Cockburn Sound, WA.
CACHE MEMORY CS 147 October 2, 2008 Sampriya Chandra.
COMPSYS 304 Computer Architecture Cache John Morris Electrical & Computer Enginering/ Computer Science, The University of Auckland Iolanthe at 13 knots.
1 Memory Hierarchy Design Chapter 5. 2 Cache Systems CPUCache Main Memory Data object transfer Block transfer CPU 400MHz Main Memory 10MHz Bus 66MHz CPU.
CSCI206 - Computer Organization & Programming
CS 704 Advanced Computer Architecture
CSE 351 Section 9 3/1/12.
How will execution time grow with SIZE?
The Hardware/Software Interface CSE351 Winter 2013
Cache Memory Presentation I
Embedded Computer Architecture
Lecture 23: Cache, Memory, Virtual Memory
Lecture 08: Memory Hierarchy Cache Performance
Adapted from slides by Sally McKee Cornell University
ECE232: Hardware Organization and Design
EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007
Siddhartha Chatterjee
Memory Operation and Performance
Caches III CSE 351 Autumn 2018 Instructor: Justin Hsia
CS 3410, Spring 2014 Computer Science Cornell University
Chapter 1 Computer System Overview
10/18: Lecture Topics Using spatial locality
Platform-based Design
Caches III CSE 351 Spring 2019 Instructor: Ruth Anderson
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

H.C. Platform-based Design 5KK702 Part 3 overview Recap on design flow Platform dependent steps –SCBD: Storage Cycle Budget Distribution –MAA: Memory Allocation and Assignment –Data layout techniques for RAM –Data layout techniques for Caches Results Conclusions Thanks to the IMEC DTSE people

H.C. Platform-based Design 5KK703 Dynamic memory mgmt Task concurrency mgmt Physical memory mgmt Address optimization SWdesignflowHWdesignflow SW/HW co-design SW/HW co-design Concurrent OO spec Remove OO overhead

H.C. Platform-based Design 5KK704 DM steps C-in Preprocessing Dataflow transformations Loop transformations Data reuse Memory hierarchy layer assignment Cycle budget distribution Memory allocation and assignment Data layout C-out Address optimization Today

H.C. Platform-based Design 5KK705 Result of Memory hierarchy assignment for cavity detection L3 L2 L1 N*M 3*1 image_in M*3 gauss_x gauss_xycomp_edgeimage_out 3*3 1*1 3*3 1*1 N*M N*M*3 N*M 0 N*M*3 N*M N*M*3N*M*8 M*3 1MB SDRAM 16KB Cache 128 B RegFile

H.C. Platform-based Design 5KK706 Data-reuse - cavity detection code for (y=0; y<M+3; ++y) { for (x=0; x<N+2; ++x) { /* first in_pixel initialized */ if (x==0 && y>=1 && y<=M-2) in_pixels[x%3] = image_in[x][y]; /* copy rest of in_pixel's in row */ if (x>=0 && x =1 && y<=M-2) in_pixels[(x+1)%3]= image_in[x+1][y]; if (x>=1 && x =1 && y<=M-2) { gauss_x_tmp=0; for (k=-1; k<=1; ++k) gauss_x_tmp += in_pixels[(x+k)%3]*Gauss[Abs(k)]; gauss_x_lines[x][y%3]= foo(gauss_x_tmp); } else if (x<N && y<M) gauss_x_lines[x][y%3] = 0; Code after reuse transformation (partly)

Storage Cycle Budget Distribution & Memory Allocation and Assignment

H.C. Platform-based Design 5KK708 Define the memory organization which can provide enough bandwidth with minimal cost

H.C. Platform-based Design 5KK709 Balancing memory bandwidth Reduce max. number of loads/store per cycle: Memory Bandwidth Required time High Memory Bandwidth Required time Low

H.C. Platform-based Design 5KK7010 Data management approach One of the many possible schedules

H.C. Platform-based Design 5KK7011 Data management approach

H.C. Platform-based Design 5KK7012 Conflict cost calculation Key issues: Number of conflicts Self conflicts Chromatic number = size of maximum clique

H.C. Platform-based Design 5KK7013 Self conflict  dual port memory

H.C. Platform-based Design 5KK7014 Chromatic number  minimum # single port memories

H.C. Platform-based Design 5KK7015 Low number of conflicts  large assignment freedom

H.C. Platform-based Design 5KK7016 time slots ? R(C) W(B) R(B) W(A) R(A) R(C) W(C) R(D) W(D) W(A) W(C) R(C) W(B) Conflict Directed Ordering is used for flat graph scheduling Reduce intervals until all conflicts known Driven by cost of conflicts Constructive algorithm

H.C. Platform-based Design 5KK7017 Local optimization is not good for global optimization

H.C. Platform-based Design 5KK7018 Budget distribution has large impact on memory cost

H.C. Platform-based Design 5KK7019 Decreasing basic block length until target cycle budget is met

H.C. Platform-based Design 5KK7020 Obtain more freedom by merging loops More scheduling freedom Extension to different threads

H.C. Platform-based Design 5KK7021 Memory allocation and assignment

H.C. Platform-based Design 5KK7022 Memory Allocation and Assignment Substeps Array-to-memory Assignment D C A B Port Assignment Bus Sharing D C A B Memory Allocation 123

H.C. Platform-based Design 5KK7023 Influence of MAA Bitwidth Address range Nr. memories Nr. ports Assign arrays to memory Memory interconnect Minimize power & Area Bitwidth (maximum) Size Nr. ports (R/W/RW) MEMORY-1 A B Bitwidth (maximum) Size Nr. ports (R/W/RW) MEMORY-N K L XXXX 1001XXXXXX

H.C. Platform-based Design 5KK7024 Example of bus sharing possibilities R(A)R(B) R(B)W(A) W(C)R(A) R(A)W(B) W(A)W(B) W(A)W(C) m1m2m3 AB X X C m1m2m3 ABC m1m2m3 AB X C

H.C. Platform-based Design 5KK7025 Decreasing cycle budget limits freedom and raises cost

H.C. Platform-based Design 5KK7026 Resulting Pareto curve for DAB synchro application Energy cost

H.C. Platform-based Design 5KK7027 Example conflict graph for cavity detection

H.C. Platform-based Design 5KK7028 MAA result Power: On-chip area:

H.C. Platform-based Design 5KK7029 Data layout how to put data into memory

H.C. Platform-based Design 5KK7030 A C ? ? B MEM1 F G ? ? H MEM2 PE A' B' ? ? CACHE Memory data layout for custom and cache architectures PE A' B' CACHE A C MEM1 B F MEM2 G H C A B C B

H.C. Platform-based Design 5KK7031 for (i=1; i<5; i++) for (j=0; j<5; j++) a[i][j] = f(a[i-1][j]); i-1 i j Window Intra-array in-place mapping reduces size of one array a time max nr. of life elements

H.C. Platform-based Design 5KK7032 variable domains abstract addresses real addresses aAaA a C A B aCaC aBaB Two-phase mapping of array elements onto addresses Storage order Allocation

H.C. Platform-based Design 5KK7033 a a=??? memory address variable domain Exploration of storage orders for 2-dimensional array a2a2 a1a1 ?????? a=3a 1 +a 2 a=3(1-a 1 )+a 2 a=3a 1 +(2-a 2 ) a=2a 2 +a 1 a=2a 2 +(1-a 1 ) a=2(2-a 2 )+a 1 a=3(1-a 1 )+(2-a 2 ) a=2(2-a 2 )+(1-a 1 )

H.C. Platform-based Design 5KK7034 Chosen storage order determines window size for (i=1; i<5; i++) for (j=0; j<5; j++) a[i][j] = f(a[i-1][j]); row-major ordering: a=5i+j for (i=1; i<5; i++) for (j=0; j<5; j++) a[5*i+j] = f(a[5*i+j-5]); Highest live address: Lowest live address: 5*i+j 5*i+j-5 Difference + 1= Window: 6 column-major: a=5j+i for (i=1; i<5; i++) for (j=0; j<5; j++) a[5*j+i] = f(a[5*j+i-1]); 5*4+i-1 5*0+i-1 21 j i

H.C. Platform-based Design 5KK7035 A B C D E Memory Size Static allocation: no in-place mapping E aEaE C aCaC A aAaA D aDaD B aBaB time

H.C. Platform-based Design 5KK7036 C Memory Size A D B E Static, windowed C Memory Size A D B E Dynamic, windowed Windowed Allocation: intra-array in-place mapping

H.C. Platform-based Design 5KK7037 Dynamic allocation: inter-array in-place mapping E aEaE C aCaC A aAaA D aDaD B aBaB A B C D E Memory Size

H.C. Platform-based Design 5KK7038 A B C E D A C E D B Memory Size Dynamic, common window Dynamic allocation strategy with common window

H.C. Platform-based Design 5KK7039 Before: bit8 B[10][20]; bit6 A[30]; for(x=0;x<10;++x) for (y=0;y<20;++y) … = A[3*x-y]; B[x][y] = …; After: bit8 memory[334]; bit8* B =(bit8*)&memory[134]; bit6* A =(bit6*)&memory[120]; for(x=0;x<10;++x) for (y=0;y<20;++y) … = A[3*x-y]; B[(x*20+y*2)%78] = …; Expressing memory data layout in source code Example: array of 10x20 elements A: offset 120, no window B: storage order [20, 2], offset 134, window 78

H.C. Platform-based Design 5KK7040 int x[W], y[W]; for (i1=0; i1 < W; i1++) x[i1] = getInput(); for (i2=0; i2 < W; i2++) { sum = 0; for (di2=-N; di2 <=N; di2++) { sum += c[N+di2] * x[wrap(i2+di2,W)]; } y[i2] = sum; } for (i3=0; i3 < W; i3++) putOutput(y[i3]); Example of memory data layout for storage size reduction

H.C. Platform-based Design 5KK7041 Occupied address-time domain of x[] and y[]

H.C. Platform-based Design 5KK7042 int mem1[N+W]; for (i1=0; i1 < W; i1++) mem1[N+i1] = getInput(); for (i2=0; i2 < W; i2++) { sum = 0; for (di2=-N; di2 <=N; di2++) { sum += c[N+di2] * mem1[N+wrap(i2+di2,W)]; } mem1[i2] = sum; } for (i3=0; i3 < W; i3++) putOutput(mem1[i3]); Optimized source code after memory data layout

H.C. Platform-based Design 5KK7043 Optimized OAT domain after memory data layout

H.C. Platform-based Design 5KK7044 In-place mapping for cavity detection example Input image is partly consumed by the time first results for output image are ready index time Image_in time address Image time index Image_out

H.C. Platform-based Design 5KK7045 In-place - cavity detection code for (y=0; y<=M+3; ++y) { for (x=0; x<N+5; ++x) { image_out[x-5][y-3] = …; /* code removed */ … = image_in[x+1][y]; } for (y=0; y<=M+3; ++y) { for (x=0; x<N+5; ++x) { image[x-5][y-3] = …; /* code removed */ … = image [x+1][y]; }

H.C. Platform-based Design 5KK7046 Cavity detection summary Overall result: Local accesses reduced by factor 3 Memory size reduced by factor 5 Power reduced by factor 5 System bus load reduced by factor 12 Performance worsened by factor 6

H.C. Platform-based Design 5KK7047 Data layout for caches Caches are hardware controled Therefore: no explicit copy coded needed ! What can we do ?

H.C. Platform-based Design 5KK7048 p-k-mmk tagindex address byte address tagdata Hit? main memory CPU 2 k lines p-k-m2 m bytes Cache line / Block Cache principles

H.C. Platform-based Design 5KK7049 Cache Architecture Fundamentals Block placement –Where in the cache will a new block be placed? Block identification –How is a block found in the cache? Block replacement policy –Which block is evicted from the cache? Updating policy –How is a block written from cache to memory?

H.C. Platform-based Design 5KK7050Cache Fully associative (one-to-many) Anywhere in cache Here only! Direct mapped (one-to-one) Here only! Memory Mapping?... Block placement policies

H.C. Platform-based Design 5KK7051 Direct mapped cache Byte offset ValidTagDataIndex Tag Index HitData Address (bit positions)

H.C. Platform-based Design 5KK7052 Taking advantage of spatial locality: Direct mapped cache: larger blocks Address (bit positions)

H.C. Platform-based Design 5KK7053 Increasing the block size tends to decrease miss rate: Performance

H.C. Platform-based Design 5KK way associative cache

H.C. Platform-based Design 5KK7055 Performance 1 KB 2 KB 8 KB

H.C. Platform-based Design 5KK7056 Cache Fundamentals The “Three C's” Compulsory Misses –1st access to a block: never in the cache Capacity Misses –Cache cannot contain all the blocks –Blocks are discarded and retrieved later –Avoided by increasing cache size Conflict Misses –Too many blocks mapped to same set –Avoided by increasing associativity

H.C. Platform-based Design 5KK7057 for(i=0; i<10; i++) A[i] = f(B[i]); i=2) A[0] B[1] B[2] B[0] A[1] A[2] --- B[3], A[3] required B[3] never loaded before  loaded into cache A[3] never loaded before  allocates new line i=3) Compulsory miss example

H.C. Platform-based Design 5KK7058 Capacity miss example B[3] B[0] A[0] i=0 B[3] B[0] A[0] B[4] B[1] A[1] i=1 A[2] B[0] A[0] B[4] B[1] A[1] B[5] B[2] i=2 A[2] B[6] B[3] A[3] B[1] A[1] B[5] B[2] i=3 A[2] B[6] B[3] A[3] B[7] B[4] A[4] B[2] i=4 B[5] A[5] B[3] A[3] B[7] B[4] A[4] B[8] i=5 B[5] A[5] B[9] B[6] A[6] B[4] A[4] B[8] i=6 for(i=0; i<N; i++) A[i] = B[i+3]+B[i]; B[5] A[5] B[9] B[6] A[6] B[10] B[7] A[7] i=7 11 compulsory misses (+8 write misses) 5 capacity misses Cache size: 8 blocks of 1 word Fully associative

H.C. Platform-based Design 5KK7059 Cache i=0) B[0][j] A[0]/B[0][j] 0 for(j=0; j<10; j++) for(i=0; i<4; i++) A[i] = A[i]+B[i][j]; A[0] 0 A[1] 1 A[2] B[3][9] B[3][0] B[0][1] A[3] B[0][0] B[1][0] B[1][1] B[2][0] B[2][1] B[3][1] 12 B[0][2] B[1][2] 13 B[2][2] B[3][2] B[0][3] 0... Memory address Cache address j=even A[0] multiply loaded A[i] read 10 times -> A[0] flushed in favor B[0][j] -> Miss j=odd Conflict miss example

H.C. Platform-based Design 5KK7060 “Three C's” vs Cache size [Gee93]

Data layout may reduce cache misses

H.C. Platform-based Design 5KK7062 Example 1: Capacity & Compulsory miss reduction B[3] B[0] A[0] i=0 B[3] B[0] A[0] B[4] B[1] A[1] i=1 A[2] B[0] A[0] B[4] B[1] A[1] B[5] B[2] i=2 A[2] B[6] B[3] A[3] B[1] A[1] B[5] B[2] i=3 A[2] B[6] B[3] A[3] B[7] B[4] A[4] B[2] i=4 B[5] A[5] B[3] A[3] B[7] B[4] A[4] B[8] i=5 B[5] A[5] B[9] B[6] A[6] B[4] A[4] B[8] i=6 for(i=0; i<N; i++) A[i] = B[i+3]+B[i]; B[5] A[5] B[9] B[6] A[6] B[10] B[7] A[7] i=7 11 compulsory misses (+8 write misses) 5 capacity misses

H.C. Platform-based Design 5KK7063 #Words B[] i 60 Cache Memory Main Memory (16 words) AB[new] Fit data in cache with in-place mapping A[] 15 Detailed Analysis: max=15 words 12 for(i=0; i<12; i++) A[i] = B[i+3]+B[i]; Traditional Analysis: max=27 words

H.C. Platform-based Design 5KK7064 Remove capacity / compulsory misses with in-place mapping AB[3] AB[0] i=0 AB[3] AB[0] AB[4] AB[1] i=1 AB[3] AB[0] AB[4] AB[1] AB[5] AB[2] i=2 AB[3] AB[0] AB[4] AB[1] AB[5] AB[2] AB[6] i=3 AB[3] AB[0] AB[4] AB[1] AB[5] AB[2] AB[6] AB[7] i=4 AB[3] AB[8] AB[4] AB[1] AB[5] AB[2] AB[6] AB[7] i=5 AB[3] AB[8] AB[4] AB[9] AB[5] AB[2] AB[6] AB[7] i=6 for(i=0; i<N; i++) AB[i] = AB[i+3]+AB[i]; AB[7] AB[8] AB[4] AB[9] AB[5] AB[10] AB[6] AB[7] i=7 11 compulsory misses 5 cache hits (+8 write hits)

H.C. Platform-based Design 5KK7065 Cache i=0) B[0][j] A[0]/B[0][j] 0 for(j=0; j<10; j++) for(i=0; i<4; i++) A[i] = A[i]+B[i][j]; A[0] 0 A[1] 1 A[2] B[3][9] B[3][0] B[0][1] A[3] B[0][0] B[1][0] B[1][1] B[2][0] B[2][1] B[3][1] 12 B[0][2] B[1][2] 13 B[2][2] B[3][2] B[0][3] 0... Memory address Cache address j=even A[0] multiply loaded A[i] read 10 times -> A[0] flushed in favor B[0][j] -> Miss j=odd Example 2: Conflict miss reduction

H.C. Platform-based Design 5KK7066 for(j=0; j<10; j++) for(i=0; i<4; i++) for(i=0; i<4; i++) A[i] = A[i]+B[i][j]; A[i] = A[i]+B[i][j]; A[0] 0 A[1] 1 A[2] B[3][9] B[3][0] B[0][1] Main Memory A[3] B[0][0] B[1][0] B[1][1] B[2][0] Leave gap B[2][1] B[3][1] B[0][2] B[0][j] A[0] 0 A[0] multiply loaded A[i] multiple x read No conflict Cache i=0) j=any © imec 2001 Avoid conflict miss with main memory data layout

H.C. Platform-based Design 5KK7067 Data Layout Organization for Direct Mapped Caches

H.C. Platform-based Design 5KK7068 Conclusion on Data Management In multi-media applications exploring data transfer and storage issues should be done at source code level DMM method –Reducing number of external memory accesses –Reducing external memory size –Trade-offs between internal memory complexity and speed –Platform independent high-level transformations –Platform dependent transformations exploit platform characteristics (efficient use of memory, cache, …) –Substantial energy reduction