Download presentation
Presentation is loading. Please wait.
Published byAubrie Wilkerson Modified over 9 years ago
1
Memory Organization and Data Layout for Instruction Set Extensions with Architecturally Visible Storage Panagiotis Athanasopoulos EPFL Philip Brisk UCR Yusuf Leblebici EPFL Paolo Ienne EPFL École Polytechnique Fédérale de Lausanne (EPFL) University of California, Riverside (UCR) First_name.Second_name@{epfl.ch|ucr.edu} 1
2
Motivation Classic Challenge Increase performance while maintaining area/cost constrained Typical solutions Customizable and extensible processors Instruction set extension (ISE) Custom functional units (CFU) Architecturally visible storage (AVS) 2
3
Typical embedded application extract 2D DCT 8x8 Matrix Pseudo: dct{ for(int i=0,i<num_of_rows,i++){. 1D DCT Slice. } for(int j=0,j<num_of_columns,j++){. 1D DCT Slice. } 3
4
for(int i=0,i<num_of_rows,i++){. 1D DCT Slice. } 0,00,10,20,30,40,50,60,7 1,01,11,21,31,41,51,61,7 2,02,12,22,32,42,52,62,7 3,03,13,23,33,43,53,63,7 4,04,14,24,34,44,54,64,7 5,05,15,25,35,45,55,65,7 6,06,16,26,36,46,56,66,7 7,07,17,27,37,47,57,67,7 1D DCT Slice Row accesses Typical embedded application extract 2D DCT 8x8 Matrix Data accessed in row i, column j i,j 4
5
for(int j=0,j<num_of_columns,j++){. 1D DCT Slice. } 0,00,10,20,30,40,50,60,7 1,01,11,21,31,41,51,61,7 2,02,12,22,32,42,52,62,7 3,03,13,23,33,43,53,63,7 4,04,14,24,34,44,54,64,7 5,05,15,25,35,45,55,65,7 6,06,16,26,36,46,56,66,7 7,07,17,27,37,47,57,67,7 1D DCT Slice Column accesses Typical embedded application extract 2D DCT 8x8 Matrix I,j Data accessed in row i, column j 5
6
Speeding up the execution ISE Extend the basic processor instruction set with a new instruction: DCT_instr CFU Assign the execution of the new instruction to a dedicated unit 6
7
Reasonable ISE/CFU implementation Pseudo: dct{ DCT_instr(0,1,2,...,7) DCT_instr(8,9,10,...,15). DCT_instr(56,57,58,...,63) DCT_instr(0,8,16,...,56) DCT_instr(1,9,17,...,56). DCT_instr(7,15,23,...,63) } 16 executions 7
8
Speeding up the execution Memory bandwidth Usually limited to 2 read/write ports Caches, scratchpads, architecturally visible storage Area quadruplicates to the number of ports [ref] Increased latency to execute the new instruction until all data is available 8
9
Speeding up the execution Ideally 8 read 8 write ports Minimum area Full bandwidth utilization Could we achieve this??? 9
10
Speeding up the execution Minimum Area What is the minimum memory organization for 64 elements with 8 read and 8 write ports? 8 individual single port 8 word capacity memory arrays (Flip Flop) 10
11
Speeding up the execution Full bandwidth utilization 0,0 1,0 2,0 3,0 4,0 5,0 6,0 7,0 0,1 1,1 2,1 3,1 4,1 5,1 6,1 7,1 0,2 1,2 2,2 3,2 4,2 5,2 6,2 7,2 0,3 1,3 2,3 3,3 4,3 5,3 6,3 7,3 0,4 1,4 2,4 3,4 4,4 5,4 6,4 7,4 0,5 1,5 2,5 3,5 4,5 5,5 6,5 7,5 0,6 1,6 2,6 3,6 4,6 5,6 6,6 7,6 0,7 1,7 2,7 3,7 4,7 5,7 6,7 7,7 Row Major Order Good for row accesses Bad for column accesses 1D DCT Slice 11
12
Speeding up the execution Full bandwidth utilization 0,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 1,0 1,1 1,2 1,3 1,4 1,5 1,6 1,7 2,0 2,1 2,2 2,3 2,4 2,5 2,6 2,7 3,0 3,1 3,2 3,3 3,4 3,5 3,6 3,7 4,0 4,1 4,2 4,3 4,4 4,5 4,6 4,7 5,0 5,1 5,2 5,3 5,4 5,5 5,6 5,7 6,0 6,1 6,2 6,3 6,4 6,5 6,6 6,7 7,0 7,1 7,2 7,3 7,4 7,5 7,6 7,7 Column Major Order Good for column accesses Bad for row accesses 1D DCT Slice 12
13
Speeding up the execution Full bandwidth utilization Would there exist a data layout that would allow row and column access with the same latency ??? Not with the existing organization What if we attempted to relax the requirements by ignoring the misalignment of data ??? Introduce alignment layers Form of Register Clustering that is cheap! [RWTH ICCAD’07] 13
14
0,0 1,1 2,2 3,3 4,4 5,5 6,6 7,7 0,1 1,0 2,3 3,2 4,5 5,4 6,7 7,6 0,2 1,3 2,0 3,1 4,6 5,7 6,4 7,5 0,3 1,2 2,1 3,0 4,7 5,6 6,5 7,4 0,4 1,5 2,6 3,7 4,0 5,1 6,2 7,3 0,5 1,4 2,7 3,6 4,1 5,0 6,3 7,2 0,6 1,7 2,4 3,5 4,2 5,3 6,0 7,1 0,7 1,6 2,5 3,4 4,3 5,2 6,1 7,0 1D DCT Slice DCT Logic Crossbar 14
15
Memory Area Comparison Area mm 2 15
16
Methodology Optimizing the memory system Enumerate Memories Memory Organization Cost Estimation Data Layout L imitedly I mproper C onstrained C olor A ssignment Alignment Layer 16
17
LICCA Formulation Input: Graph G = (V,E,I) Vertices V = {v 0,...,v n-1 } Edges E = {e 0,...,e m-1 } Set of Set of vertices I = {I 0,...,I L-1 } Where: E = {(v x, v y )| ∃ I j ∈ E ∋ v x ∈ I j and v y ∈ I j } 17
18
LICCA Formulation Solution: Assignment of colors to vertices Every function f: V→{0,..., k-1} A maximum of n i vertices can receive color i, 0<i<k-1 ; that is, |{v ∈ V| f(v) = i}| < n i For each set I j ∈ I, there can be at most a i vertices that receive color i. Any instance of the k-colorability problem can be reduced to an instance of LICCA by setting I = {{vx, vy| (vx, vy) ∈ E}}, and, for 0<i<k-1: n i =|V| and a i =1 18
19
LICCA Relation to the problem Relation to the problem: An edge e = (v x, v y ) indicates that v x and v y are read in the same cycle Each set of vertices I j ∈ I is a set of vertices that are read in parallel k is the number of memories n i is the capacity of the i th memory a i is the number read/write ports of the i th memory 19
20
LICCA Example V = {v 0,v 1,v 2,v 3,v 4,v 5 } I 0 = {v 0,v 1,v 2 } I 1 = {v 3,v 4,v 5 } I 2 = {v 0,v 2,v 5 } E = {(v 0,v 1 ),(v 0,v 2 ),(v 0,v 5 ),(v 1,v 2 ),(v 2,v 5 ),(v 3,v 4 ),(v 3,v 5 ),(v 4,v 5 )} Legal k-coloring? Legal LICCA coloring? v0v0 v3v3 v1v1 v4v4 v5v5 v2v2 G 20
21
LICCA Example v0v0 v1v1 v2v2 v3v3 v4v4 v5v5 v0v0 v2v2 v5v5 I0I0 I1I1 I2I2 v0v0 v4v4 v2v2 v3v3 v5v5 v1v1 n 1 =2 a 1 =1 n 0 =4 a 0 =2 M1M1 M0M0 21
22
ISE Logic AVS (Single/Dual Port Memory or 8x8 Non- clustered RF) RF Baseline Processor Baseline Processor Ports Memory Decoder Main Memory (DMA) Comparison Example 22
23
ISE Logic AVS (8x8 clustered RF) RF Baseline Processor Baseline Processor Ports Memory Decoder Main Memory (DMA) Alignment Layer Alignment Layer Decoders Comparison Example 23
24
Comparison Example 2D DCT 8x8 Matrix DCT row/column Slice VS 2-point 8x8 Clustered RF VS Single port Memory 150 MHz 2D FFT 8x8 Matrix 12 butterfly VS 1 butterfly 8x8 Clustered RF VS Single port Memory 150 MHz 24
25
Comparison Example 2D DCT 8x8 Matrix 3x8x 25
26
Comparison Example 2D FFT 8x8 Matrix 2,5x12x 26
27
Conclusion Methodology to efficiently increase bandwidth to AVS enhanced ISEs LICCA Memory System Optimization Future Work Commutativity LICCA Extension for multiple ISEs and shift registers 27
28
28 0,0 0,2 2,0 2,2 0,1 0,3 2,1 2,3 1,0 1,2 3,0 3,2 1,1 1,3 3,1 3,3 0,0 1,2 2,0 3,2 0,1 1,3 2,1 3,3 1,0 0,2 2,2 3,0 1,1 0,3 2,3 3,1 4x4 Non- Clustered RF 2,1 0,2 2,3 3,1 0,0 1,3 2,1 0,1 2,2 3,1 1,1 0,3 3,2 1,2 2,0 3,3
29
References 29
30
Thank you! Questions? 30
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.