Séminaire COSI-Roscoff’011 Séminaire COSI ’01 Power Driven Processor Array Partitionning for FPGA SoC S.Derrien, S. Rajopadhye.

Slides:

Advertisements

Similar presentations

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.

Advertisements

Some Trends in High-level Synthesis Research Tools Tanguy Risset Compsys, Lip, ENS-Lyon

Vector Processing as a Soft-core CPU Accelerator Jason Yu, Guy Lemieux, Chris Eagleston {jasony, lemieux, University of British Columbia.

Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow

Main MemoryCS510 Computer ArchitecturesLecture Lecture 15 Main Memory.

Instruction Set Design

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.

A reconfigurable system featuring dynamically extensible embedded microprocessor, FPGA, and customizable I/O Borgatti, M. Lertora, F. Foret, B. Cali, L.

System Design Tricks for Low-Power Video Processing Jonah Probell, Director of Multimedia Solutions, ARC International.

*time Optimization Heiko, Diego, Thomas, Kevin, Andreas, Jens.

High Performance Embedded Computing © 2007 Elsevier Chapter 7, part 1: Hardware/Software Co-Design High Performance Embedded Computing Wayne Wolf.

08/31/2001Copyright CECS & The Spark Project SPARK High Level Synthesis System Sumit GuptaTimothy KamMichael KishinevskyShai Rotem Nick SavoiuNikil DuttRajesh.

MEMORY ORGANIZATION Memory Hierarchy Main Memory Auxiliary Memory

The Design Process Outline Goal Reading Design Domain Design Flow

Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.

University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

1 HW/SW Partitioning Embedded Systems Design. 2 Hardware/Software Codesign “Exploration of the system design space formed by combinations of hardware.

Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.

Memory Hierarchy.1 Review: Major Components of a Computer Processor Control Datapath Memory Devices Input Output.

Configurable System-on-Chip: Xilinx EDK

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

MAPLD 2005 A High-Performance Radix-2 FFT in ANSI C for RTL Generation John Ardini.

ECE669 L23: Parallel Compilation April 29, 2004 ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation.

Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.

1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

Antoine Monsifrot François Bodin CAPS Team Computer Aided Hand Tuning June 2001.

B212/MAPLD 2005 Craven1 Configurable Soft Processor Arrays Using the OpenFire Processor Stephen Craven Cameron Patterson Peter Athanas Configurable Computing.

CSIE30300 Computer Architecture Unit 07: Main Memory Hsin-Chou Chi [Adapted from material by and

Making FPGAs a Cost-Effective Computing Architecture Tom VanCourt Yongfeng Gu Martin Herbordt Boston University BOSTON UNIVERSITY.

Presenter: Hong-Wei Zhuang On-Chip SOC Test Platform Design Based on IEEE 1500 Standard Very Large Scale Integration (VLSI) Systems, IEEE Transactions.

CAD for Physical Design of VLSI Circuits

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

Efficient FPGA Implementation of QR

Extreme Makeover for EDA Industry

Automated Design of Custom Architecture Tulika Mitra

Efficient Mapping onto Coarse-Grained Reconfigurable Architectures using Graph Drawing based Algorithm Jonghee Yoon, Aviral Shrivastava *, Minwook Ahn,

Storage Allocation for Embedded Processors By Jan Sjodin & Carl von Platen Present by Xie Lei ( PLS Lab)

J. Christiansen, CERN - EP/MIC

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Distributed computing using Projective Geometry: Decoding of Error correcting codes Nachiket Gajare, Hrishikesh Sharma and Prof. Sachin Patkar IIT Bombay.

L11: Lower Power High Level Synthesis(2) 성균관대학교 조 준 동 교수

HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.

Lecture 16: Reconfigurable Computing Applications November 3, 2004 ECE 697F Reconfigurable Computing Lecture 16 Reconfigurable Computing Applications.

ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.

COARSE GRAINED RECONFIGURABLE ARCHITECTURES 04/18/2014 Aditi Sharma Dhiraj Chaudhary Pruthvi Gowda Rachana Raj Sunku DAY

Development of Programmable Architecture for Base-Band Processing S. Leung, A. Postula, Univ. of Queensland, Australia A. Hemani, Royal Institute of Tech.,

System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.

Analysis of Cache Tuner Architectural Layouts for Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing.

Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs  Architecture optimized.

Survey of multicore architectures Marko Bertogna Scuola Superiore S.Anna, ReTiS Lab, Pisa, Italy.

Hy-C A Compiler Retargetable for Single-Chip Heterogeneous Multiprocessors Philip Sweany 8/27/2010.

Lx: A Technology Platform for Customizable VLIW Embedded Processing.

University of Michigan Electrical Engineering and Computer Science Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators.

Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.

1 of 14 Lab 2: Formal verification with UPPAAL. 2 of 14 2 The gossiping persons There are n persons. All have one secret to tell, which is not known to.

1 of 14 Lab 2: Design-Space Exploration with MPARM.

† Dept. Computer Science and Engineering The Pennsylvania State University ‡ IMEC, Belgium Estimating Influence of Data Layout Optimizations on SDRAM Energy.

System on a Programmable Chip (System on a Reprogrammable Chip)

An Approach for Enhancing Inter- processor Data Locality on Chip Multiprocessors Guilin Chen and Mahmut Kandemir The Pennsylvania State University, USA.

Andreas Hoffmann Andreas Ropers Tim Kogel Stefan Pees Prof

Modeling of Digital Systems

Verilog to Routing CAD Tool Optimization

The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.

A High Performance SoC: PkunityTM

Final Project presentation

Department of Electrical Engineering Joint work with Jiong Luo

Presentation transcript:

Séminaire COSI-Roscoff’011 Séminaire COSI ’01 Power Driven Processor Array Partitionning for FPGA SoC S.Derrien, S. Rajopadhye

Séminaire COSI-Roscoff’012 Content n Context and motivations Silicon compilation tools Target architectures Power consumption Related work n Partitioning n Modeling Power n Experimental results n Conclusion

Séminaire COSI-Roscoff’013 Silicon compilation tools n Parallel processor array architectures Regular and scalable (well suited to FPGAs) Specialized high-performance data-path n Restricted class of loops SUREs (uniform dependencies) Static polyhedral loop domain n Compute intensive nested loops Image processing (motion estimation, stereo vision) Signal processing (QR factorization, DLMS)

Séminaire COSI-Roscoff’014 Power consumption n General model and motivations P=Pstat+Vdd.Cd.Df (gate level model) Estimate at RTL level (entropy based models) n Mainly dictated by : On chip area cost and activity Off-chip I/O volume n System level power model ? Estimate from specs and target arch.

Séminaire COSI-Roscoff’015 Target architecture FPGA CPU System Memory Ext world n Embedded CPU Power PC NIOS n Soc bus Amba, Coreconnect Plug ’n play IP cores n Shared Memory Low latency High bandwidth

Séminaire COSI-Roscoff’016 Related Work n Compiler transformations to reduce mem accesses [Kandemir] Loop fusion Loop tiling Loop reordering n Design space exploration for custom memory systems [Imec] Systematic exploration Multi-level memory hierachy The approach is brute force

Séminaire COSI-Roscoff’017 Content n Context and motivations n Target architectures n Partitioning Clustering (LSGP) Tiling (LPGS) Co-partitionning n modeling Power n Experimental results n Conclusion

Séminaire COSI-Roscoff’018 n Partition PE array into Tiles Tiles are executed sequentially Intermediate results stored in off-chip memory requires unidirectionnal communications : n Tile shape is rectangular Bound // to PE space base vectors Perfect « Tiling » of processor space Tiling (LPGS)

Séminaire COSI-Roscoff’019 Tiling (LPGS)   =2   =3 Matrix  diagonal det|  |=N pe domain height

Séminaire COSI-Roscoff’0110 n Regroups PEs into Clusters operations executed sequentially I/O accesses reduced n Cluster shape is rectangular Bound // to PE space basis vectors Perfect « Tiling » of processor space n Scheduling is axes-major Several possible schedulings Seq. of clustering along each axis Simplifies control logic Clustering (LSGP)

Séminaire COSI-Roscoff’0111 Clustering (LSGP)  y =3  y =2 Matrix  diagonal det|  |=N pe size  y x…x  x PE index vector Iteration index vector Original space- time mapping

Séminaire COSI-Roscoff’0112 Clustering (LSGP) PE original  x =2  x =2,  y =3 Resource usage estimate :

Séminaire COSI-Roscoff’0113 Hybrid-partitioning n Step1 : array is Tiled Tune the I/O volume n Step2 : Tile is clusteredArray Tune the resource usage n Trade-Off Off-chip I/O Volume Local memory sizes

Séminaire COSI-Roscoff’0114 Content n Context and motivations n Target architectures n Partitioning n modeling Power IO power model Core power model Putting it all together n Experimental results n Conclusion

Séminaire COSI-Roscoff’0115 Dynamic IO Energy model n IO Energy depends on IO volume (Ram clock speed) Operation (Rd,Wr) Port Toggle rate E io =K rd.V rd + K wr.V wr n Determine IO volume For all loop variables Given tiling parameters Number write I/O operations Technological constant

Séminaire COSI-Roscoff’0116 n Tile IO volume is called « foot print » Estimate for this foot print [Arg95] Spread vector of dependencies IO Volume estimate (1/2) : substituting i th row with spread vector

Séminaire COSI-Roscoff’0117 n Total Tile IO volume: n Example : d A =[1 0 0] a A =[1 0 0] l A =2 V A = 2.H.  1 d B =[0 1 0] a B =[1 0 0] l B =2 V B = 2.H.   d C =[0 0 1] a C =[1 0 0] l C =4 V C =     IO Volume estimate (1/2) k th variable byte widthNumber of variables Tile size parameterSpread vector

Séminaire COSI-Roscoff’0118 n FPGA power dissipation model P core =P stat +K c.D lc.n lc.f Not suited to our target FPGA architecture. n Distinction between LCs (mem and logic) P core =P stat +K c.D lc.n lc.f+ K m.D m.n m.f Core power model (1/4) Technology constant Average toggle rate Nbs of logic cells Design operating freq.

Séminaire COSI-Roscoff’0119 Core power model (2/4) n Control logic is not modeled too complex to estimate no significant contribution to power n Core power depends on Number of PEs : depends on  and  Area usage for each PE : depends on  Average toggle rate for PE datapath and local memory (application constant)

Séminaire COSI-Roscoff’0120 Core power model (3/4) n Memory ressource usage LCs used as distributed memory (16x1bits) Datapath is design constant (library based) n Area cost for a PE array Clustering parameter along processor space j Register width along processor space k Datapath functional cost Number of PEs

Séminaire COSI-Roscoff’0121 Core power model (4/4) n Energy cost for the whole loop nest we have E c =P c.n cycle.T cycle we will consider n cycle =V calc /n p n Total core energy cost Energy is not dependant on n p !! Total loop computation volumeAverage toggle rate

Séminaire COSI-Roscoff’0122 Content n Context and motivations n Target architectures n Partitioning n Modeling Power n Experimental results Model validation Extrapolations n Conclusion

Séminaire COSI-Roscoff’0123 IO power model results

Séminaire COSI-Roscoff’0124 Core power model results

Séminaire COSI-Roscoff’0125 System power model

Séminaire COSI-Roscoff’0126 Content n Context and motivations n Target architectures n Partitioning n modeling Power n Experimental results n Conclusion Solving the optimisation problem (Lagrange Multipliers) Custom cache for embedded CPUs Extension to SAREs (affine dependances)

Séminaire COSI-Roscoff’0127 Conclusion n Models matches experiments Cheap measurement setup Many components contribute to current dissipation (LEDs, PCI, etc…) n Observations Trade-off evolves with technology More sensitive for Asics ?

Séminaire COSI-Roscoff’0128 Future Work(1/2) n Formulation of the optimization pb Minimize Energy/iteration Contraints on Performance and Area n Analitycal solution ? Lagrange multipliers No closed form for n>3 BUT fast numerical methods

Séminaire COSI-Roscoff’0129 Future Work(2/2) n Model for embedded CPUs Trade-off cache-size and memory acceses. Determine optimal cache size and associated tiling parameters. n Extension to SARE ? Affine dependencies. More general loops.