Memory Organization and Data Layout for Instruction Set Extensions with Architecturally Visible Storage Panagiotis Athanasopoulos EPFL Philip Brisk UCR.

Memory Organization and Data Layout for Instruction Set Extensions with Architecturally Visible Storage Panagiotis Athanasopoulos EPFL Philip Brisk UCR Yusuf Leblebici EPFL Paolo Ienne EPFL École Polytechnique Fédérale de Lausanne (EPFL) University of California, Riverside (UCR) First_name.Second_name@{epfl.ch|ucr.edu} 1

Motivation  Classic Challenge  Increase performance while maintaining area/cost constrained  Typical solutions  Customizable and extensible processors  Instruction set extension (ISE)  Custom functional units (CFU)  Architecturally visible storage (AVS) 2

Typical embedded application extract 2D DCT 8x8 Matrix Pseudo: dct{ for(int i=0,i<num_of_rows,i++){. 1D DCT Slice. } for(int j=0,j<num_of_columns,j++){. 1D DCT Slice. } 3

for(int i=0,i<num_of_rows,i++){. 1D DCT Slice. } 0,00,10,20,30,40,50,60,7 1,01,11,21,31,41,51,61,7 2,02,12,22,32,42,52,62,7 3,03,13,23,33,43,53,63,7 4,04,14,24,34,44,54,64,7 5,05,15,25,35,45,55,65,7 6,06,16,26,36,46,56,66,7 7,07,17,27,37,47,57,67,7 1D DCT Slice Row accesses Typical embedded application extract 2D DCT 8x8 Matrix Data accessed in row i, column j i,j 4

for(int j=0,j<num_of_columns,j++){. 1D DCT Slice. } 0,00,10,20,30,40,50,60,7 1,01,11,21,31,41,51,61,7 2,02,12,22,32,42,52,62,7 3,03,13,23,33,43,53,63,7 4,04,14,24,34,44,54,64,7 5,05,15,25,35,45,55,65,7 6,06,16,26,36,46,56,66,7 7,07,17,27,37,47,57,67,7 1D DCT Slice Column accesses Typical embedded application extract 2D DCT 8x8 Matrix I,j Data accessed in row i, column j 5

Speeding up the execution  ISE  Extend the basic processor instruction set with a new instruction: DCT_instr  CFU  Assign the execution of the new instruction to a dedicated unit 6

Reasonable ISE/CFU implementation Pseudo: dct{ DCT_instr(0,1,2,...,7) DCT_instr(8,9,10,...,15). DCT_instr(56,57,58,...,63) DCT_instr(0,8,16,...,56) DCT_instr(1,9,17,...,56). DCT_instr(7,15,23,...,63) } 16 executions 7

Speeding up the execution  Memory bandwidth  Usually limited to 2 read/write ports  Caches, scratchpads, architecturally visible storage  Area quadruplicates to the number of ports [ref]  Increased latency to execute the new instruction until all data is available 8

Speeding up the execution  Ideally  8 read 8 write ports  Minimum area  Full bandwidth utilization  Could we achieve this??? 9

Speeding up the execution  Minimum Area  What is the minimum memory organization for 64 elements with 8 read and 8 write ports?  8 individual single port 8 word capacity memory arrays (Flip Flop) 10

Speeding up the execution  Full bandwidth utilization 0,0 1,0 2,0 3,0 4,0 5,0 6,0 7,0 0,1 1,1 2,1 3,1 4,1 5,1 6,1 7,1 0,2 1,2 2,2 3,2 4,2 5,2 6,2 7,2 0,3 1,3 2,3 3,3 4,3 5,3 6,3 7,3 0,4 1,4 2,4 3,4 4,4 5,4 6,4 7,4 0,5 1,5 2,5 3,5 4,5 5,5 6,5 7,5 0,6 1,6 2,6 3,6 4,6 5,6 6,6 7,6 0,7 1,7 2,7 3,7 4,7 5,7 6,7 7,7 Row Major Order Good for row accesses Bad for column accesses 1D DCT Slice 11

Speeding up the execution  Full bandwidth utilization 0,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 1,0 1,1 1,2 1,3 1,4 1,5 1,6 1,7 2,0 2,1 2,2 2,3 2,4 2,5 2,6 2,7 3,0 3,1 3,2 3,3 3,4 3,5 3,6 3,7 4,0 4,1 4,2 4,3 4,4 4,5 4,6 4,7 5,0 5,1 5,2 5,3 5,4 5,5 5,6 5,7 6,0 6,1 6,2 6,3 6,4 6,5 6,6 6,7 7,0 7,1 7,2 7,3 7,4 7,5 7,6 7,7 Column Major Order Good for column accesses Bad for row accesses 1D DCT Slice 12

Speeding up the execution  Full bandwidth utilization  Would there exist a data layout that would allow row and column access with the same latency ???  Not with the existing organization  What if we attempted to relax the requirements by ignoring the misalignment of data ???  Introduce alignment layers  Form of Register Clustering that is cheap! [RWTH ICCAD’07] 13

0,0 1,1 2,2 3,3 4,4 5,5 6,6 7,7 0,1 1,0 2,3 3,2 4,5 5,4 6,7 7,6 0,2 1,3 2,0 3,1 4,6 5,7 6,4 7,5 0,3 1,2 2,1 3,0 4,7 5,6 6,5 7,4 0,4 1,5 2,6 3,7 4,0 5,1 6,2 7,3 0,5 1,4 2,7 3,6 4,1 5,0 6,3 7,2 0,6 1,7 2,4 3,5 4,2 5,3 6,0 7,1 0,7 1,6 2,5 3,4 4,3 5,2 6,1 7,0 1D DCT Slice DCT Logic Crossbar 14

Memory Area Comparison Area mm 2 15

Methodology  Optimizing the memory system  Enumerate Memories  Memory Organization  Cost Estimation  Data Layout L imitedly I mproper C onstrained C olor A ssignment  Alignment Layer 16

LICCA Formulation  Input:  Graph G = (V,E,I)  Vertices V = {v 0,...,v n-1 }  Edges E = {e 0,...,e m-1 }  Set of Set of vertices I = {I 0,...,I L-1 }  Where: E = {(v x, v y )| ∃ I j ∈ E ∋ v x ∈ I j and v y ∈ I j } 17

LICCA Formulation  Solution:  Assignment of colors to vertices  Every function f: V→{0,..., k-1}  A maximum of n i vertices can receive color i, 0<i<k-1 ; that is, |{v ∈ V| f(v) = i}| < n i  For each set I j ∈ I, there can be at most a i vertices that receive color i.  Any instance of the k-colorability problem can be reduced to an instance of LICCA by setting I = {{vx, vy| (vx, vy) ∈ E}}, and, for 0<i<k-1: n i =|V| and a i =1 18

LICCA Relation to the problem  Relation to the problem:  An edge e = (v x, v y ) indicates that v x and v y are read in the same cycle  Each set of vertices I j ∈ I is a set of vertices that are read in parallel  k is the number of memories  n i is the capacity of the i th memory  a i is the number read/write ports of the i th memory 19

LICCA Example  V = {v 0,v 1,v 2,v 3,v 4,v 5 }  I 0 = {v 0,v 1,v 2 }  I 1 = {v 3,v 4,v 5 }  I 2 = {v 0,v 2,v 5 }  E = {(v 0,v 1 ),(v 0,v 2 ),(v 0,v 5 ),(v 1,v 2 ),(v 2,v 5 ),(v 3,v 4 ),(v 3,v 5 ),(v 4,v 5 )}  Legal k-coloring?  Legal LICCA coloring? v0v0 v3v3 v1v1 v4v4 v5v5 v2v2 G 20

LICCA Example v0v0 v1v1 v2v2 v3v3 v4v4 v5v5 v0v0 v2v2 v5v5 I0I0 I1I1 I2I2 v0v0 v4v4 v2v2 v3v3 v5v5 v1v1 n 1 =2 a 1 =1 n 0 =4 a 0 =2 M1M1 M0M0 21

ISE Logic AVS (Single/Dual Port Memory or 8x8 Non- clustered RF) RF Baseline Processor Baseline Processor Ports Memory Decoder Main Memory (DMA) Comparison Example 22

ISE Logic AVS (8x8 clustered RF) RF Baseline Processor Baseline Processor Ports Memory Decoder Main Memory (DMA) Alignment Layer Alignment Layer Decoders Comparison Example 23

Comparison Example  2D DCT 8x8 Matrix  DCT row/column Slice VS 2-point  8x8 Clustered RF VS Single port Memory  150 MHz  2D FFT 8x8 Matrix  12 butterfly VS 1 butterfly  8x8 Clustered RF VS Single port Memory  150 MHz 24

Comparison Example  2D DCT 8x8 Matrix 3x8x 25

Comparison Example  2D FFT 8x8 Matrix 2,5x12x 26

Conclusion  Methodology to efficiently increase bandwidth to AVS enhanced ISEs  LICCA  Memory System Optimization Future Work  Commutativity  LICCA Extension for multiple ISEs and shift registers 27

28 0,0 0,2 2,0 2,2 0,1 0,3 2,1 2,3 1,0 1,2 3,0 3,2 1,1 1,3 3,1 3,3 0,0 1,2 2,0 3,2 0,1 1,3 2,1 3,3 1,0 0,2 2,2 3,0 1,1 0,3 2,3 3,1 4x4 Non- Clustered RF 2,1 0,2 2,3 3,1 0,0 1,3 2,1 0,1 2,2 3,1 1,1 0,3 3,2 1,2 2,0 3,3

References 29

Thank you! Questions? 30

Memory Organization and Data Layout for Instruction Set Extensions with Architecturally Visible Storage Panagiotis Athanasopoulos EPFL Philip Brisk UCR.

Similar presentations

Presentation on theme: "Memory Organization and Data Layout for Instruction Set Extensions with Architecturally Visible Storage Panagiotis Athanasopoulos EPFL Philip Brisk UCR."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Memory Organization and Data Layout for Instruction Set Extensions with Architecturally Visible Storage Panagiotis Athanasopoulos EPFL Philip Brisk UCR.

Similar presentations

Presentation on theme: "Memory Organization and Data Layout for Instruction Set Extensions with Architecturally Visible Storage Panagiotis Athanasopoulos EPFL Philip Brisk UCR."— Presentation transcript:

Similar presentations

About project

Feedback