Nasibeh Teimouri Hamed Tabkhi Gunar Schirner Summer 2014 Quantitative Evaluation of MPSoC with Many Accelerators; Challenges and Potential Solutions.

Slides:



Advertisements
Similar presentations
Network Resource Broker for IPTV in Cloud Computing Lei Liang, Dan He University of Surrey, UK OGF 27, G2C Workshop 15 Oct 2009 Banff,
Advertisements

3D Graphics Content Over OCP Martti Venell Sr. Verification Engineer Bitboys.
Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
Please do not distribute
Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.
SDN Controller Challenges
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
System Simulation Of 1000-cores Heterogeneous SoCs Shivani Raghav Embedded System Laboratory (ESL) Ecole Polytechnique Federale de Lausanne (EPFL)
Khaled A. Al-Utaibi  Computers are Every Where  What is Computer Engineering?  Design Levels  Computer Engineering Fields  What.
Architectural Considerations for CPU and Network Interface Integration C. D. Cranor; R. Gopalakrishnan; P. Z. Onufryk IEEE Micro Volume: 201, Jan.-Feb.
Define Embedded Systems Small (?) Application Specific Computer Systems.
Device Management.
Implications for Programming Models Todd C. Mowry CS 495 September 12, 2002.
Presenter : Cheng-Ta Wu Antti Rasmus, Ari Kulmala, Erno Salminen, and Timo D. Hämäläinen Tampere University of Technology, Institute of Digital and Computer.
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.
Rensselaer Polytechnic Institute CSC 432 – Operating Systems David Goldschmidt, Ph.D.
Towards a Distributed, Service-Oriented Control Infrastructure for Smart Grid ASU - Cyber Physical Systems Lab Professor G. Fainekos Presenter: Ramtin.
Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
2017/4/21 Towards Full Virtualization of Heterogeneous Noc-based Multicore Embedded Architecture 2012 IEEE 15th International Conference on Computational.
Processor Structure & Operations of an Accumulator Machine
CASTNESS‘11 Computer Architectures and Software Tools for Numerical Embedded Scalable Systems Workshop & School: Roma January 17-18th 2011 Frédéric ROUSSEAU.
ECE 720T5 Fall 2012 Cyber-Physical Systems Rodolfo Pellizzoni.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
Rensselaer Polytechnic Institute CSCI-4210 – Operating Systems CSCI-6140 – Computer Operating Systems David Goldschmidt, Ph.D.
REXAPP Bilal Saqib. REXAPP  Radio EXperimentation And Prototyping Platform Based on NOC  REXAPP Compiler.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
Static Translation of Stream Programs S. M. Farhad School of Information Technology The University of Sydney.
Heterogeneous Multikernel OS Yauhen Klimiankou BSUIR
C5- IT Infrastructure and Emerging Technologies. Input – Process - Output 2 A computer  Takes data as input  Processes it  Outputs information CPU.
Object-Oriented Software Engineering Practical Software Development using UML and Java Chapter 1: Software and Software Engineering.
© 2012 xtUML.org Bill Chown – Mentor Graphics Model Driven Engineering.
© 2004 Mercury Computer Systems, Inc. FPGAs & Software Components Graham Bardouleau & Jim Kulp Mercury Computer Systems, Inc. High Performance Embedded.
Accessing I/O Devices Processor Memory BUS I/O Device 1 I/O Device 2.
SpecC stands for “specification description language based on C”.
Optimal Power Allocation for Multiprogrammed Workloads on Single-chip Heterogeneous Processors Euijin Kwon 1,2 Jae Young Jang 2 Jae W. Lee 2 Nam Sung Kim.
MODUS Project FP7- SME – , Eclipse Conference Toulouse, May 6 th 2013 Page 1 MODUS Project FP Methodology and Supporting Toolset Advancing.
Modes of transfer in computer
Harmony: A Run-Time for Managing Accelerators Sponsor: LogicBlox Inc. Gregory Diamos and Sudhakar Yalamanchili.
Abstract A Structured Approach for Modular Design: A Plug and Play Middleware for Sensory Modules, Actuation Platforms, Task Descriptions and Implementations.
What is a Microprocessor ? A microprocessor consists of an ALU to perform arithmetic and logic manipulations, registers, and a control unit Its has some.
Computer Science and Engineering Power-Performance Considerations of Parallel Computing on Chip Multiprocessors Jian Li and Jose F. Martinez ACM Transactions.
Lx: A Technology Platform for Customizable VLIW Embedded Processing.
Physically Aware HW/SW Partitioning for Reconfigurable Architectures with Partial Dynamic Reconfiguration Sudarshan Banarjee, Elaheh Bozorgzadeh, Nikil.
1 Device Controller I/O units typically consist of A mechanical component: the device itself An electronic component: the device controller or adapter.
1 of 14 Lab 2: Formal verification with UPPAAL. 2 of 14 2 The gossiping persons There are n persons. All have one secret to tell, which is not known to.
1 of 14 Lab 2: Design-Space Exploration with MPARM.
Static Translation of Stream Program to a Parallel System S. M. Farhad The University of Sydney.
Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.
ANASOFT VIATUS. Challenges Supply chain optimization is necessary for achieving competitive price of final products Synchronization and utilization of.
Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.
VU-Advanced Computer Architecture Lecture 1-Introduction 1 Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 1.
Mobile IP THE 12 TH MEETING. Mobile IP  Incorporation of mobile users in the network.  Cellular system (e.g., GSM) started with mobility in mind. 
Embedded Systems. What is Embedded Systems?  Embedded reflects the facts that they are an integral.
Using «dot clock» Displays in Embedded Linux Devices
TI Information – Selective Disclosure
Ph.D. in Computer Science
Taeweon Suh §, Daehyun Kim †, and Hsien-Hsin S. Lee § June 15, 2005
Enabling machine learning in embedded systems
Parallel Programming By J. H. Wang May 2, 2017.
Improving java performance using Dynamic Method Migration on FPGAs
Short Circuiting Memory Traffic in Handheld Platforms
Virtual Memory: Working Sets
Database System Architectures
Chapter 13: I/O Systems.
Cloud-DNN: An Open Framework for Mapping DNN Models to Cloud FPGAs
Chapter 13: I/O Systems “The two main jobs of a computer are I/O and [CPU] processing. In many cases, the main job is I/O, and the [CPU] processing is.
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Nasibeh Teimouri Hamed Tabkhi Gunar Schirner Summer 2014 Quantitative Evaluation of MPSoC with Many Accelerators; Challenges and Potential Solutions

Outline Heterogeneous MPSoCs –Specialization is a growing trend Accelerator-rich MPSoC architecture –MPSoCs with many accelerators Previous works Quantitative exploration of current accelerator-rich MPSoC - Huge memory demand - Immerse communication traffic - Overwhelming Synchronization The proposed accelerator centric architecture template - Implementation - Evaluation 2

Outline Heterogeneous MPSoCs –Specialization is a growing trend Accelerator-rich MPSoC architecture –MPSoCs with many accelerators Previous work Quantitative exploration of accelerator-rich MPSoC - Huge memory demand - Immerse communication traffic - Overwhelming Synchronization The proposed accelerator centric architecture template - Implementation - Evaluation 3

H eterogeneous MPSoCs 4 –Heterogeneous MPSoCs –Integrated solutions for a group of evolving markets ILP (e.g. CPU, DSP, or even GPU) Flexibility - Power dissipation Custom-HW Accelerators (ACCs) for compute- intensive kernels Power efficiency -Cost -Inflexibility  What is the trend?

Specialization as a MPSoC trend 5 Increasing demands for high performance low power computing –Market examples: Embedded vision Software Define Radio (ADR) Cyber Physical Systems (CPS) –Tens billion of operations per second –Less than few watts power - Trend: Domain specific specialization –Proliferating number of ACCs in systems  ACC-Rich MPSoC

Outline Heterogeneous MPSoCs –Specialization is a growing trend Accelerator-rich MPSoC architecture –MPSoCs with many accelerators Previous work Quantitative exploration of accelerator-rich MPSoC - Huge memory demand - Immerse communication traffic - Overwhelming Synchronization The proposed accelerator centric architecture template - Implementation - Evaluation 6

Principals of current accelerator-rich MPSoC 7 ILP+HWACC composition –HW-ACC Executes Compute-intense kernels/apps –ILP Executes remaining applications Orchestrates HWACCs / coordinate data movement –On-chip scratchpad memory (SPM) Keeps data between ILP and ACCs on-chip –Avoid costly off-chip memory access 1. Input Done 2.DMA Start 3.DMA Done 4.DMA Start 5.DMA Done 6.ACC1 Start 7.ACC1 Done 8.DMA Start 9.DMA Done 10.DMA Start 11.DMA Done 12.Output start 13-Output Done

MPSoC with many accelerators 8 Scratch Pad Memory (SPM) -2 per accelerator, 1 per I/O -To hold input job DMA DMADMA SPM: Scratch Pad Memory Control and interrupt lines - ACC configuration Centralized vs. dedicated DMA - Stream data transfer

Challenges with increasing number of interrupts 1- Memory requirement - Two SPM per each ACC -One SPM per each Interfaces -Shared memory to hold data handed over the accelerators 2- High volume of traffic over system fabric - No point to point connections between ACC - Required DMA data transfers 3-ILP synchronization - Among accelerators, IO Interfaces and DMA transfers 9

Outline Heterogeneous MPSoCs –Specialization is a growing trend Accelerator-rich MPSoC architecture –MPSoCs with many accelerators Previous work Quantitative exploration of accelerator-rich MPSoC - Huge memory demand - Immerse communication traffic - Overwhelming Synchronization The proposed accelerator centric architecture template - Implementation - Evaluation 10

Previous works on composing ACC Composing bigger applications out of many accelerates like Accelerator- Rich CMPs[1], CHARM[2] –Imposing a considerable traffic and considerable on-chip buffers for accelerator data exchange –ILP load to orchestrate the system composed of accelerators 1.J. Cong, M. A. Ghodrat, M. Gill, B. Grigorian, and G. Reinman. Architecture support for accelerator-rich cmps. In Proceedings of the 49th Annual Design Automation Conference, DAC ’12, pages 843–849, M. Lyons, M. Hempstead, G.-Y. Wei, and D. Brooks. The accelerator store framework for high- performance, low-power accelerator-based systems.Computer Architecture Letters, 9(2):53 –56, feb

Outline Heterogeneous MPSoCs –Specialization is a growing trend Accelerator-rich MPSoC architecture –MPSoCs with many accelerators Previous work Quantitative exploration of accelerator-rich MPSoC - Huge memory demand - Immerse communication traffic - Overwhelming Synchronization The proposed accelerator centric architecture template - Implementation - Evaluation 12

Quantitative exploration of accelerator-rich MPSoC; WHY and HOW Applicability of quantitative exploration –Quantifying the potential challenges –Exposing the ACC-rich bottlenecks as # of ACCs increases –Helping system architects for proper sizing of systems knobs (SPM sizes, # of ACCs, Communication BW) –Motivating our proposed arch-template solution Approaches of quantitative exploration 1- First order mathematic based analysis 2- Simulation based analysis of ACC-rich MPSoC 13

Exploration overview Assumptions –One HD resolution frame as input Divided into smaller jobs –Memory on chip Avoid off-chip memory for now Exploration steps –Memory requirement as #ACC increases –Sizing SPM to satisfy memory budget limitation –Interrupt rate load on ILP 14

Memory size analysis (calculation based) 15 Memory size = SPMs + shared memory SPM holds one job Job size determines minimum size of SPM and shared memory Shared memory holds all jobs exchanged among ACCs  Sizing job size with respect to memory budget More ACCs requires larger memory Bigger job needs larger memory Limiting memory budget

Job sizing (calculation based) 16 The lower the size of memory, the smaller the size of job The more #accelerators, the smaller job size  Smaller job size issues more interrupts to ILP - Responsibility of ILP to synchronize ACCs transactions - Count the number of interrupts - Measure ILP responsibility to response Interrupts

Simulation platform 17 Using SpecC SLDL to develop a simulation model –Scalable # of ACCs »Different/same data rate –ILPs –DMAs –Mummeries (SPM, shared memory) »On-chip and off-chip memory Generating ACC-Rich simulation model –BFM AMBA-AHB Communication fabric –ARM 9 (ISA v6) for ILP execution Priority based –Dedicated interrupt line –Centralized DMAs SCE refinement

# of interrupt by scaling #ACC (simulation based) # of interrupt vs. the number of accelerators –For different size of on-chip memory 18  More interrupts to the ILP with smaller job size - Significant utilization or even over saturation of ILP only because of driving accelerators Memory Size (MB) #ACC K128K512K1M 432K64K256K512K 916K32K128K256K 188K16K64K128K 344K8K32K64K 602K4K16K32K

Communication overhead analysis (calculation based) 19 Communication overhead = data exchanged through the system fabric More ACCs, heavier traffic on system fabric

Exploration Summery Problems affiliated with current accelerator-rich architecture –On-chip memory requirements –ILP synchronization load –Heavy communication traffic on system fabric Demands toward improved ACC-centric design –Tackling the challenges of current ACC-rich architecture 20

Outline Heterogeneous MPSoCs –Specialization is a growing trend Accelerator-rich MPSoC architecture –MPSoCs with many accelerators Previous work Quantitative exploration of accelerator-rich MPSoC - Huge memory demand - Immerse communication traffic - Overwhelming Synchronization The proposed accelerator centric architecture template - Implementation - Evaluation 21

The goals of the proposed ACC-centric arcitecture The proposed solution –An autonomous accelerator chain Relieving ILP’s synchronization load –Point to point connections between accelerators No need for larger SPM per each accelerator No frequent DMA data transfers No heavy traffic on system fabric 22

Simulation platform 23 Modifying the developed SpecC model to support autonomous chain of accelerator –Gateways to manage the chain Creating another ACC-Rich simulation model –BFM AMBA-AHB Communication fabric –ARM 9 (ISA v6) for ILP execution –Dedicated interrupt line from gateways to ILP –Centralized DMA SCE refinement

The proposed accelerator-centric architecture template Gateways controlled by ILP to manage the whole chain of accelerators –SPM to receive/send data from/to memory –Control lines from ILP to gateways for configuration –Interrupt lines from gateways to ILP –Point to point connections in chain with small buffer in between –Chain works independence of ILP 24 Point to point accelerator connections No much memory requirement Not many DMA data transfer Autonomous ACC chain: Light ILP synchronization load no matter how many accelerators 1. DMA brings data to the input gateway’s SPM 2. Input gateway receives data and starts to pass data through the chain 3. Chain works on data 4. Output gateway gathers data in SPM 5. DMA brings data to memory

Evaluation 25 MORE ACC: Current arch: exponential growth in interrupts Proposed architecture: The same number of interrupts MORE ACC: Current arch: Heavier traffic Proposed arch: almost the same data traffic MORE ACC: Current arch: Smaller job Proposed arch: almost the same job MORE ACC: Current arch: Linear growth in memory requirement Proposed arch: almost constant memory requirement

Summary Specialization as a growing trend in CMPs –Accelerator rich architectures Exploration of the challenges in current accelerator rich architecture –Memory requirement –Communication overhead –Synchronization load The proposed accelerator-centric architecture template –Autonomous accelerator chain No large memory requirement No heavy communication traffic No critical amount of required synchronization 26

Question? Again, Thanks to Professor Schirner for all his support… Thanks to Hamed for what I’ve been learning from him, Thank you all ESL members for your attendance! 27