Memory Organization and Data Layout for Instruction Set Extensions with Architecturally Visible Storage Panagiotis Athanasopoulos EPFL Philip Brisk UCR.

Slides:

Advertisements

Similar presentations

Advertisements

Tutorial 2 Sequential Logic. Registers A register is basically a D Flip-Flop A D Flip Flop has 3 basic ports. D, Q, and Clock.

Lecture 4 Introduction to Digital Signal Processors (DSPs) Dr. Konstantinos Tatas.

Commercial FPGAs: Altera Stratix Family Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.

Anshul Kumar, CSE IITD CSL718 : Main Memory 6th Mar, 2006.

1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.

Architectural Improvement for Field Programmable Counter Array: Enabling Efficient Synthesis of Fast Compressor Trees on FPGA Alessandro Cevrero 1,2 Panagiotis.

Selective Flexibility: Breaking the Rigidity of Datapath Merging Mirjana Stojilović, Institute Mihailo Pupin, University of Belgrade David Novo, École.

Scalable and Low Cost Design Approach for Variable Block Size Motion Estimation Hadi Afshar, Philip Brisk, Paolo Ienne EPFL Hadi Afshar, Philip Brisk,

 Understanding the Sources of Inefficiency in General-Purpose Chips.

Instruction Set Extension for Dynamic Time Warping Joseph Tarango, Eammon Keogh, Philip Brisk

Bryan Lahartinger. “The Apriori algorithm is a fundamental correlation-based data mining [technique]” “Software implementations of the Aprioiri algorithm.

Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.

A Flexible Parallel Architecture Adapted to Block-Matching Motion-Estimation Algorithms Santanu Dutta, and Wayne Wolf IEEE Trans. On CSVT, vol. 6, NO.

State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.

Hot Chips 16August 24, 2004 OptimoDE: Programmable Accelerator Engines Through Retargetable Customization Nathan Clark, Hongtao Zhong, Kevin Fan, Scott.

COMPUTER ARCHITECTURE & OPERATIONS I Instructor: Hao Ji.

Embedded DRAM for a Reconfigurable Array S.Perissakis, Y.Joo 1, J.Ahn 1, A.DeHon, J.Wawrzynek University of California, Berkeley 1 LG Semicon Co., Ltd.

GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.

GPGPU platforms GP - General Purpose computation using GPU

Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.

Yehdhih Ould Mohammed Moctar1 Nithin George2 Hadi Parandeh-Afshar2

Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale.

Very Long Instruction Word (VLIW) Architecture. VLIW Machine It consists of many functional units connected to a large central register file Each functional.

Philip Brisk 2 Paolo Ienne 2 Hadi Parandeh-Afshar 1,2 1: University of Tehran, ECE Department 2: EPFL, School of Computer and Communication Sciences Efficient.

A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.

Chapter One Introduction to Pipelined Processors.

A Flexible DSP Block to Enhance FGPA Arithmetic Performance

Quadratic Programming Solver for Image Deblurring Engine Rahul Rithe, Michael Price Massachusetts Institute of Technology.

CPEN Digital System Design

A Graph Based Algorithm for Data Path Optimization in Custom Processors J. Trajkovic, M. Reshadi, B. Gorjiara, D. Gajski Center for Embedded Computer Systems.

Radix-2 2 Based Low Power Reconfigurable FFT Processor Presented by Cheng-Chien Wu, Master Student of CSIE,CCU 1 Author: Gin-Der Wu and Yi-Ming Liu Department.

CS5222 Advanced Computer Architecture Part 3: VLIW Architecture

Architecture and Design Automation for Application-Specific Processors Philip Brisk Assistant Professor Dept. of Computer Science and Engineering University.

C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.

A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator Farhad Mehdipour, Hamid Noori, Hiroaki Honda, Koji Inoue, Kazuaki.

By Edward A. Lee, J.Reineke, I.Liu, H.D.Patel, S.Kim

RICE UNIVERSITY DSPs for future wireless systems Sridhar Rajagopal.

OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.

A High-Level Synthesis Flow for Custom Instruction Set Extensions for Application-Specific Processors Asia and South Pacific Design Automation Conference.

08/10/ NRL Hybrid QR Factorization Algorithm for High Performance Computing Architectures Peter Vouras Naval Research Laboratory Radar Division Professor.

® Virtex-E Extended Memory Technical Overview and Applications.

TELL40 VELO time ordering Pablo Vázquez, Jan Buytaert, Karol Hennessy, Marco Gersabeck, Pablo Rodríguez P. Vazquez (U. Santiago)112/12/2013.

Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.

WorldScape Defense Company, L.L.C. Company Proprietary Slide 1 An Ultra-High Performance Scalable Processing Architecture for HPC and Embedded Applications.

07/11/2005 Register File Design and Memory Design Presentation E CSE : Introduction to Computer Architecture Slides by Gojko Babić.

Speculative DMA for Architecturally Visible Storage in Instruction Set Extensions Theo KluterEPFL Philip BriskEPFL Paolo IenneEPFL Edoardo CharbonEPFL.

Re-configurable Bus Encoding Scheme for Reducing Power Consumption of the Cross Coupling Capacitance for Deep Sub-micron Instructions Bus Siu-Kei Wong.

C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 20/11/2015 Compilers for Embedded Systems.

Buffering Techniques Greg Stitt ECE Department University of Florida.

All Pairs Shortest Path Algorithms Aditya Sehgal Amlan Bhattacharya.

ESE532: System-on-a-Chip Architecture

A Case for Standard-Cell Based RAMs in Highly-Ported Superscalar Processor Structures Sungkwan Ku, Elliott Forbes, Rangeen Basu Roy Chowdhury, Eric Rotenberg.

Presenter: Darshika G. Perera Assistant Professor

CS161 – Design and Architecture of Computer Systems

Seth Pugsley, Jeffrey Jestes,

Morgan Kaufmann Publishers

Embedded Systems Design

Modeling of Digital Systems

Christopher Han-Yu Chou Supervisor: Dr. Guy Lemieux

Exploring Concentration and Channel Slicing in On-chip Network Router

Spare Register Aware Prefetching for Graph Algorithms on GPUs

Jason Cong, Guoling Han, Zhiru Zhang VLSI CAD Lab

STUDY AND IMPLEMENTATION

High Throughput LDPC Decoders Using a Multiple Split-Row Method

Objectives Describe common CPU components and their function: ALU Arithmetic Logic Unit), CU (Control Unit), Cache Explain the function of the CPU as.

Presentation transcript:

Memory Organization and Data Layout for Instruction Set Extensions with Architecturally Visible Storage Panagiotis Athanasopoulos EPFL Philip Brisk UCR Yusuf Leblebici EPFL Paolo Ienne EPFL École Polytechnique Fédérale de Lausanne (EPFL) University of California, Riverside (UCR) 1

Motivation  Classic Challenge  Increase performance while maintaining area/cost constrained  Typical solutions  Customizable and extensible processors  Instruction set extension (ISE)  Custom functional units (CFU)  Architecturally visible storage (AVS) 2

Typical embedded application extract 2D DCT 8x8 Matrix Pseudo: dct{ for(int i=0,i<num_of_rows,i++){. 1D DCT Slice. } for(int j=0,j<num_of_columns,j++){. 1D DCT Slice. } 3

for(int i=0,i<num_of_rows,i++){. 1D DCT Slice. } 0,00,10,20,30,40,50,60,7 1,01,11,21,31,41,51,61,7 2,02,12,22,32,42,52,62,7 3,03,13,23,33,43,53,63,7 4,04,14,24,34,44,54,64,7 5,05,15,25,35,45,55,65,7 6,06,16,26,36,46,56,66,7 7,07,17,27,37,47,57,67,7 1D DCT Slice Row accesses Typical embedded application extract 2D DCT 8x8 Matrix Data accessed in row i, column j i,j 4

for(int j=0,j<num_of_columns,j++){. 1D DCT Slice. } 0,00,10,20,30,40,50,60,7 1,01,11,21,31,41,51,61,7 2,02,12,22,32,42,52,62,7 3,03,13,23,33,43,53,63,7 4,04,14,24,34,44,54,64,7 5,05,15,25,35,45,55,65,7 6,06,16,26,36,46,56,66,7 7,07,17,27,37,47,57,67,7 1D DCT Slice Column accesses Typical embedded application extract 2D DCT 8x8 Matrix I,j Data accessed in row i, column j 5

Speeding up the execution  ISE  Extend the basic processor instruction set with a new instruction: DCT_instr  CFU  Assign the execution of the new instruction to a dedicated unit 6

Reasonable ISE/CFU implementation Pseudo: dct{ DCT_instr(0,1,2,...,7) DCT_instr(8,9,10,...,15). DCT_instr(56,57,58,...,63) DCT_instr(0,8,16,...,56) DCT_instr(1,9,17,...,56). DCT_instr(7,15,23,...,63) } 16 executions 7

Speeding up the execution  Memory bandwidth  Usually limited to 2 read/write ports  Caches, scratchpads, architecturally visible storage  Area quadruplicates to the number of ports [ref]  Increased latency to execute the new instruction until all data is available 8

Speeding up the execution  Ideally  8 read 8 write ports  Minimum area  Full bandwidth utilization  Could we achieve this??? 9

Speeding up the execution  Minimum Area  What is the minimum memory organization for 64 elements with 8 read and 8 write ports?  8 individual single port 8 word capacity memory arrays (Flip Flop) 10

Speeding up the execution  Full bandwidth utilization 0,0 1,0 2,0 3,0 4,0 5,0 6,0 7,0 0,1 1,1 2,1 3,1 4,1 5,1 6,1 7,1 0,2 1,2 2,2 3,2 4,2 5,2 6,2 7,2 0,3 1,3 2,3 3,3 4,3 5,3 6,3 7,3 0,4 1,4 2,4 3,4 4,4 5,4 6,4 7,4 0,5 1,5 2,5 3,5 4,5 5,5 6,5 7,5 0,6 1,6 2,6 3,6 4,6 5,6 6,6 7,6 0,7 1,7 2,7 3,7 4,7 5,7 6,7 7,7 Row Major Order Good for row accesses Bad for column accesses 1D DCT Slice 11

Speeding up the execution  Full bandwidth utilization 0,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 1,0 1,1 1,2 1,3 1,4 1,5 1,6 1,7 2,0 2,1 2,2 2,3 2,4 2,5 2,6 2,7 3,0 3,1 3,2 3,3 3,4 3,5 3,6 3,7 4,0 4,1 4,2 4,3 4,4 4,5 4,6 4,7 5,0 5,1 5,2 5,3 5,4 5,5 5,6 5,7 6,0 6,1 6,2 6,3 6,4 6,5 6,6 6,7 7,0 7,1 7,2 7,3 7,4 7,5 7,6 7,7 Column Major Order Good for column accesses Bad for row accesses 1D DCT Slice 12

Speeding up the execution  Full bandwidth utilization  Would there exist a data layout that would allow row and column access with the same latency ???  Not with the existing organization  What if we attempted to relax the requirements by ignoring the misalignment of data ???  Introduce alignment layers  Form of Register Clustering that is cheap! [RWTH ICCAD’07] 13

0,0 1,1 2,2 3,3 4,4 5,5 6,6 7,7 0,1 1,0 2,3 3,2 4,5 5,4 6,7 7,6 0,2 1,3 2,0 3,1 4,6 5,7 6,4 7,5 0,3 1,2 2,1 3,0 4,7 5,6 6,5 7,4 0,4 1,5 2,6 3,7 4,0 5,1 6,2 7,3 0,5 1,4 2,7 3,6 4,1 5,0 6,3 7,2 0,6 1,7 2,4 3,5 4,2 5,3 6,0 7,1 0,7 1,6 2,5 3,4 4,3 5,2 6,1 7,0 1D DCT Slice DCT Logic Crossbar 14

Memory Area Comparison Area mm 2 15

Methodology  Optimizing the memory system  Enumerate Memories  Memory Organization  Cost Estimation  Data Layout L imitedly I mproper C onstrained C olor A ssignment  Alignment Layer 16

LICCA Formulation  Input:  Graph G = (V,E,I)  Vertices V = {v 0,...,v n-1 }  Edges E = {e 0,...,e m-1 }  Set of Set of vertices I = {I 0,...,I L-1 }  Where: E = {(v x, v y )| ∃ I j ∈ E ∋ v x ∈ I j and v y ∈ I j } 17

LICCA Formulation  Solution:  Assignment of colors to vertices  Every function f: V→{0,..., k-1}  A maximum of n i vertices can receive color i, 0<i<k-1 ; that is, |{v ∈ V| f(v) = i}| < n i  For each set I j ∈ I, there can be at most a i vertices that receive color i.  Any instance of the k-colorability problem can be reduced to an instance of LICCA by setting I = {{vx, vy| (vx, vy) ∈ E}}, and, for 0<i<k-1: n i =|V| and a i =1 18

LICCA Relation to the problem  Relation to the problem:  An edge e = (v x, v y ) indicates that v x and v y are read in the same cycle  Each set of vertices I j ∈ I is a set of vertices that are read in parallel  k is the number of memories  n i is the capacity of the i th memory  a i is the number read/write ports of the i th memory 19

LICCA Example  V = {v 0,v 1,v 2,v 3,v 4,v 5 }  I 0 = {v 0,v 1,v 2 }  I 1 = {v 3,v 4,v 5 }  I 2 = {v 0,v 2,v 5 }  E = {(v 0,v 1 ),(v 0,v 2 ),(v 0,v 5 ),(v 1,v 2 ),(v 2,v 5 ),(v 3,v 4 ),(v 3,v 5 ),(v 4,v 5 )}  Legal k-coloring?  Legal LICCA coloring? v0v0 v3v3 v1v1 v4v4 v5v5 v2v2 G 20

LICCA Example v0v0 v1v1 v2v2 v3v3 v4v4 v5v5 v0v0 v2v2 v5v5 I0I0 I1I1 I2I2 v0v0 v4v4 v2v2 v3v3 v5v5 v1v1 n 1 =2 a 1 =1 n 0 =4 a 0 =2 M1M1 M0M0 21

ISE Logic AVS (Single/Dual Port Memory or 8x8 Non- clustered RF) RF Baseline Processor Baseline Processor Ports Memory Decoder Main Memory (DMA) Comparison Example 22

ISE Logic AVS (8x8 clustered RF) RF Baseline Processor Baseline Processor Ports Memory Decoder Main Memory (DMA) Alignment Layer Alignment Layer Decoders Comparison Example 23

Comparison Example  2D DCT 8x8 Matrix  DCT row/column Slice VS 2-point  8x8 Clustered RF VS Single port Memory  150 MHz  2D FFT 8x8 Matrix  12 butterfly VS 1 butterfly  8x8 Clustered RF VS Single port Memory  150 MHz 24

Comparison Example  2D DCT 8x8 Matrix 3x8x 25

Comparison Example  2D FFT 8x8 Matrix 2,5x12x 26

Conclusion  Methodology to efficiently increase bandwidth to AVS enhanced ISEs  LICCA  Memory System Optimization Future Work  Commutativity  LICCA Extension for multiple ISEs and shift registers 27

28 0,0 0,2 2,0 2,2 0,1 0,3 2,1 2,3 1,0 1,2 3,0 3,2 1,1 1,3 3,1 3,3 0,0 1,2 2,0 3,2 0,1 1,3 2,1 3,3 1,0 0,2 2,2 3,0 1,1 0,3 2,3 3,1 4x4 Non- Clustered RF 2,1 0,2 2,3 3,1 0,0 1,3 2,1 0,1 2,2 3,1 1,1 0,3 3,2 1,2 2,0 3,3

References 29

Thank you! Questions? 30