Download presentation
Presentation is loading. Please wait.
Published byLincoln Doten Modified over 9 years ago
1
Nasibeh Teimouri Hamed Tabkhi Gunar Schirner Summer 2014 Quantitative Evaluation of MPSoC with Many Accelerators; Challenges and Potential Solutions
2
Outline Heterogeneous MPSoCs –Specialization is a growing trend Accelerator-rich MPSoC architecture –MPSoCs with many accelerators Previous works Quantitative exploration of current accelerator-rich MPSoC - Huge memory demand - Immerse communication traffic - Overwhelming Synchronization The proposed accelerator centric architecture template - Implementation - Evaluation 2
3
Outline Heterogeneous MPSoCs –Specialization is a growing trend Accelerator-rich MPSoC architecture –MPSoCs with many accelerators Previous work Quantitative exploration of accelerator-rich MPSoC - Huge memory demand - Immerse communication traffic - Overwhelming Synchronization The proposed accelerator centric architecture template - Implementation - Evaluation 3
4
H eterogeneous MPSoCs 4 –Heterogeneous MPSoCs –Integrated solutions for a group of evolving markets ILP (e.g. CPU, DSP, or even GPU) Flexibility - Power dissipation Custom-HW Accelerators (ACCs) for compute- intensive kernels Power efficiency -Cost -Inflexibility What is the trend?
5
Specialization as a MPSoC trend 5 Increasing demands for high performance low power computing –Market examples: Embedded vision Software Define Radio (ADR) Cyber Physical Systems (CPS) –Tens billion of operations per second –Less than few watts power - Trend: Domain specific specialization –Proliferating number of ACCs in systems ACC-Rich MPSoC
6
Outline Heterogeneous MPSoCs –Specialization is a growing trend Accelerator-rich MPSoC architecture –MPSoCs with many accelerators Previous work Quantitative exploration of accelerator-rich MPSoC - Huge memory demand - Immerse communication traffic - Overwhelming Synchronization The proposed accelerator centric architecture template - Implementation - Evaluation 6
7
Principals of current accelerator-rich MPSoC 7 ILP+HWACC composition –HW-ACC Executes Compute-intense kernels/apps –ILP Executes remaining applications Orchestrates HWACCs / coordinate data movement –On-chip scratchpad memory (SPM) Keeps data between ILP and ACCs on-chip –Avoid costly off-chip memory access 1. Input Done 2.DMA Start 3.DMA Done 4.DMA Start 5.DMA Done 6.ACC1 Start 7.ACC1 Done 8.DMA Start 9.DMA Done 10.DMA Start 11.DMA Done 12.Output start 13-Output Done
8
MPSoC with many accelerators 8 Scratch Pad Memory (SPM) -2 per accelerator, 1 per I/O -To hold input job DMA DMADMA SPM: Scratch Pad Memory Control and interrupt lines - ACC configuration Centralized vs. dedicated DMA - Stream data transfer
9
Challenges with increasing number of interrupts 1- Memory requirement - Two SPM per each ACC -One SPM per each Interfaces -Shared memory to hold data handed over the accelerators 2- High volume of traffic over system fabric - No point to point connections between ACC - Required DMA data transfers 3-ILP synchronization - Among accelerators, IO Interfaces and DMA transfers 9
10
Outline Heterogeneous MPSoCs –Specialization is a growing trend Accelerator-rich MPSoC architecture –MPSoCs with many accelerators Previous work Quantitative exploration of accelerator-rich MPSoC - Huge memory demand - Immerse communication traffic - Overwhelming Synchronization The proposed accelerator centric architecture template - Implementation - Evaluation 10
11
Previous works on composing ACC Composing bigger applications out of many accelerates like Accelerator- Rich CMPs[1], CHARM[2] –Imposing a considerable traffic and considerable on-chip buffers for accelerator data exchange –ILP load to orchestrate the system composed of accelerators 1.J. Cong, M. A. Ghodrat, M. Gill, B. Grigorian, and G. Reinman. Architecture support for accelerator-rich cmps. In Proceedings of the 49th Annual Design Automation Conference, DAC ’12, pages 843–849, 2012. 2.M. Lyons, M. Hempstead, G.-Y. Wei, and D. Brooks. The accelerator store framework for high- performance, low-power accelerator-based systems.Computer Architecture Letters, 9(2):53 –56, feb. 2010. 11
12
Outline Heterogeneous MPSoCs –Specialization is a growing trend Accelerator-rich MPSoC architecture –MPSoCs with many accelerators Previous work Quantitative exploration of accelerator-rich MPSoC - Huge memory demand - Immerse communication traffic - Overwhelming Synchronization The proposed accelerator centric architecture template - Implementation - Evaluation 12
13
Quantitative exploration of accelerator-rich MPSoC; WHY and HOW Applicability of quantitative exploration –Quantifying the potential challenges –Exposing the ACC-rich bottlenecks as # of ACCs increases –Helping system architects for proper sizing of systems knobs (SPM sizes, # of ACCs, Communication BW) –Motivating our proposed arch-template solution Approaches of quantitative exploration 1- First order mathematic based analysis 2- Simulation based analysis of ACC-rich MPSoC 13
14
Exploration overview Assumptions –One HD resolution frame as input Divided into smaller jobs –Memory on chip Avoid off-chip memory for now Exploration steps –Memory requirement as #ACC increases –Sizing SPM to satisfy memory budget limitation –Interrupt rate load on ILP 14
15
Memory size analysis (calculation based) 15 Memory size = SPMs + shared memory SPM holds one job Job size determines minimum size of SPM and shared memory Shared memory holds all jobs exchanged among ACCs Sizing job size with respect to memory budget More ACCs requires larger memory Bigger job needs larger memory Limiting memory budget
16
Job sizing (calculation based) 16 The lower the size of memory, the smaller the size of job The more #accelerators, the smaller job size Smaller job size issues more interrupts to ILP - Responsibility of ILP to synchronize ACCs transactions - Count the number of interrupts - Measure ILP responsibility to response Interrupts
17
Simulation platform 17 Using SpecC SLDL to develop a simulation model –Scalable # of ACCs »Different/same data rate –ILPs –DMAs –Mummeries (SPM, shared memory) »On-chip and off-chip memory Generating ACC-Rich simulation model –BFM AMBA-AHB Communication fabric –ARM 9 (ISA v6) for ILP execution Priority based –Dedicated interrupt line –Centralized DMAs SCE refinement
18
# of interrupt by scaling #ACC (simulation based) # of interrupt vs. the number of accelerators –For different size of on-chip memory 18 More interrupts to the ILP with smaller job size - Significant utilization or even over saturation of ILP only because of driving accelerators Memory Size (MB) #ACC0.5148 164K128K512K1M 432K64K256K512K 916K32K128K256K 188K16K64K128K 344K8K32K64K 602K4K16K32K
19
Communication overhead analysis (calculation based) 19 Communication overhead = data exchanged through the system fabric More ACCs, heavier traffic on system fabric
20
Exploration Summery Problems affiliated with current accelerator-rich architecture –On-chip memory requirements –ILP synchronization load –Heavy communication traffic on system fabric Demands toward improved ACC-centric design –Tackling the challenges of current ACC-rich architecture 20
21
Outline Heterogeneous MPSoCs –Specialization is a growing trend Accelerator-rich MPSoC architecture –MPSoCs with many accelerators Previous work Quantitative exploration of accelerator-rich MPSoC - Huge memory demand - Immerse communication traffic - Overwhelming Synchronization The proposed accelerator centric architecture template - Implementation - Evaluation 21
22
The goals of the proposed ACC-centric arcitecture The proposed solution –An autonomous accelerator chain Relieving ILP’s synchronization load –Point to point connections between accelerators No need for larger SPM per each accelerator No frequent DMA data transfers No heavy traffic on system fabric 22
23
Simulation platform 23 Modifying the developed SpecC model to support autonomous chain of accelerator –Gateways to manage the chain Creating another ACC-Rich simulation model –BFM AMBA-AHB Communication fabric –ARM 9 (ISA v6) for ILP execution –Dedicated interrupt line from gateways to ILP –Centralized DMA SCE refinement
24
The proposed accelerator-centric architecture template Gateways controlled by ILP to manage the whole chain of accelerators –SPM to receive/send data from/to memory –Control lines from ILP to gateways for configuration –Interrupt lines from gateways to ILP –Point to point connections in chain with small buffer in between –Chain works independence of ILP 24 Point to point accelerator connections No much memory requirement Not many DMA data transfer Autonomous ACC chain: Light ILP synchronization load no matter how many accelerators 1. DMA brings data to the input gateway’s SPM 2. Input gateway receives data and starts to pass data through the chain 3. Chain works on data 4. Output gateway gathers data in SPM 5. DMA brings data to memory 1 2 3 4 5
25
Evaluation 25 MORE ACC: Current arch: exponential growth in interrupts Proposed architecture: The same number of interrupts MORE ACC: Current arch: Heavier traffic Proposed arch: almost the same data traffic MORE ACC: Current arch: Smaller job Proposed arch: almost the same job MORE ACC: Current arch: Linear growth in memory requirement Proposed arch: almost constant memory requirement
26
Summary Specialization as a growing trend in CMPs –Accelerator rich architectures Exploration of the challenges in current accelerator rich architecture –Memory requirement –Communication overhead –Synchronization load The proposed accelerator-centric architecture template –Autonomous accelerator chain No large memory requirement No heavy communication traffic No critical amount of required synchronization 26
27
Question? Again, Thanks to Professor Schirner for all his support… Thanks to Hamed for what I’ve been learning from him, Thank you all ESL members for your attendance! 27
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.