Download presentation
Presentation is loading. Please wait.
Published byShanna Ellis Modified over 9 years ago
1
Séminaire COSI-Roscoff’011 Séminaire COSI ’01 Power Driven Processor Array Partitionning for FPGA SoC S.Derrien, S. Rajopadhye
2
Séminaire COSI-Roscoff’012 Content n Context and motivations Silicon compilation tools Target architectures Power consumption Related work n Partitioning n Modeling Power n Experimental results n Conclusion
3
Séminaire COSI-Roscoff’013 Silicon compilation tools n Parallel processor array architectures Regular and scalable (well suited to FPGAs) Specialized high-performance data-path n Restricted class of loops SUREs (uniform dependencies) Static polyhedral loop domain n Compute intensive nested loops Image processing (motion estimation, stereo vision) Signal processing (QR factorization, DLMS)
4
Séminaire COSI-Roscoff’014 Power consumption n General model and motivations P=Pstat+Vdd.Cd.Df (gate level model) Estimate at RTL level (entropy based models) n Mainly dictated by : On chip area cost and activity Off-chip I/O volume n System level power model ? Estimate from specs and target arch.
5
Séminaire COSI-Roscoff’015 Target architecture FPGA CPU System Memory Ext world n Embedded CPU Power PC NIOS n Soc bus Amba, Coreconnect Plug ’n play IP cores n Shared Memory Low latency High bandwidth
6
Séminaire COSI-Roscoff’016 Related Work n Compiler transformations to reduce mem accesses [Kandemir] Loop fusion Loop tiling Loop reordering n Design space exploration for custom memory systems [Imec] Systematic exploration Multi-level memory hierachy The approach is brute force
7
Séminaire COSI-Roscoff’017 Content n Context and motivations n Target architectures n Partitioning Clustering (LSGP) Tiling (LPGS) Co-partitionning n modeling Power n Experimental results n Conclusion
8
Séminaire COSI-Roscoff’018 n Partition PE array into Tiles Tiles are executed sequentially Intermediate results stored in off-chip memory requires unidirectionnal communications : n Tile shape is rectangular Bound // to PE space base vectors Perfect « Tiling » of processor space Tiling (LPGS)
9
Séminaire COSI-Roscoff’019 Tiling (LPGS) =2 =3 Matrix diagonal det| |=N pe domain height
10
Séminaire COSI-Roscoff’0110 n Regroups PEs into Clusters operations executed sequentially I/O accesses reduced n Cluster shape is rectangular Bound // to PE space basis vectors Perfect « Tiling » of processor space n Scheduling is axes-major Several possible schedulings Seq. of clustering along each axis Simplifies control logic Clustering (LSGP)
11
Séminaire COSI-Roscoff’0111 Clustering (LSGP) y =3 y =2 Matrix diagonal det| |=N pe size y x…x x PE index vector Iteration index vector Original space- time mapping
12
Séminaire COSI-Roscoff’0112 Clustering (LSGP) 12 6 1 1 1 1 3 1 PE original x =2 x =2, y =3 Resource usage estimate :
13
Séminaire COSI-Roscoff’0113 Hybrid-partitioning n Step1 : array is Tiled Tune the I/O volume n Step2 : Tile is clusteredArray Tune the resource usage n Trade-Off Off-chip I/O Volume Local memory sizes
14
Séminaire COSI-Roscoff’0114 Content n Context and motivations n Target architectures n Partitioning n modeling Power IO power model Core power model Putting it all together n Experimental results n Conclusion
15
Séminaire COSI-Roscoff’0115 Dynamic IO Energy model n IO Energy depends on IO volume (Ram clock speed) Operation (Rd,Wr) Port Toggle rate E io =K rd.V rd + K wr.V wr n Determine IO volume For all loop variables Given tiling parameters Number write I/O operations Technological constant
16
Séminaire COSI-Roscoff’0116 n Tile IO volume is called « foot print » Estimate for this foot print [Arg95] Spread vector of dependencies IO Volume estimate (1/2) : substituting i th row with spread vector
17
Séminaire COSI-Roscoff’0117 n Total Tile IO volume: n Example : d A =[1 0 0] a A =[1 0 0] l A =2 V A = 2.H. 1 d B =[0 1 0] a B =[1 0 0] l B =2 V B = 2.H. d C =[0 0 1] a C =[1 0 0] l C =4 V C = IO Volume estimate (1/2) k th variable byte widthNumber of variables Tile size parameterSpread vector
18
Séminaire COSI-Roscoff’0118 n FPGA power dissipation model P core =P stat +K c.D lc.n lc.f Not suited to our target FPGA architecture. n Distinction between LCs (mem and logic) P core =P stat +K c.D lc.n lc.f+ K m.D m.n m.f Core power model (1/4) Technology constant Average toggle rate Nbs of logic cells Design operating freq.
19
Séminaire COSI-Roscoff’0119 Core power model (2/4) n Control logic is not modeled too complex to estimate no significant contribution to power n Core power depends on Number of PEs : depends on and Area usage for each PE : depends on Average toggle rate for PE datapath and local memory (application constant)
20
Séminaire COSI-Roscoff’0120 Core power model (3/4) n Memory ressource usage LCs used as distributed memory (16x1bits) Datapath is design constant (library based) n Area cost for a PE array Clustering parameter along processor space j Register width along processor space k Datapath functional cost Number of PEs
21
Séminaire COSI-Roscoff’0121 Core power model (4/4) n Energy cost for the whole loop nest we have E c =P c.n cycle.T cycle we will consider n cycle =V calc /n p n Total core energy cost Energy is not dependant on n p !! Total loop computation volumeAverage toggle rate
22
Séminaire COSI-Roscoff’0122 Content n Context and motivations n Target architectures n Partitioning n Modeling Power n Experimental results Model validation Extrapolations n Conclusion
23
Séminaire COSI-Roscoff’0123 IO power model results
24
Séminaire COSI-Roscoff’0124 Core power model results
25
Séminaire COSI-Roscoff’0125 System power model
26
Séminaire COSI-Roscoff’0126 Content n Context and motivations n Target architectures n Partitioning n modeling Power n Experimental results n Conclusion Solving the optimisation problem (Lagrange Multipliers) Custom cache for embedded CPUs Extension to SAREs (affine dependances)
27
Séminaire COSI-Roscoff’0127 Conclusion n Models matches experiments Cheap measurement setup Many components contribute to current dissipation (LEDs, PCI, etc…) n Observations Trade-off evolves with technology More sensitive for Asics ?
28
Séminaire COSI-Roscoff’0128 Future Work(1/2) n Formulation of the optimization pb Minimize Energy/iteration Contraints on Performance and Area n Analitycal solution ? Lagrange multipliers No closed form for n>3 BUT fast numerical methods
29
Séminaire COSI-Roscoff’0129 Future Work(2/2) n Model for embedded CPUs Trade-off cache-size and memory acceses. Determine optimal cache size and associated tiling parameters. n Extension to SARE ? Affine dependencies. More general loops.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.