System Simulation Of 1000-cores Heterogeneous SoCs Shivani Raghav Embedded System Laboratory (ESL) Ecole Polytechnique Federale de Lausanne (EPFL)

Slides:



Advertisements
Similar presentations
Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Exploring Memory Consistency for Massively Threaded Throughput- Oriented Processors Blake Hechtman Daniel J. Sorin 0.
Lecture 6: Multicore Systems
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
1 Lawrence Livermore National Laboratory By Chunhua (Leo) Liao, Stephen Guzik, Dan Quinlan A node-level programming model framework for exascale computing*
A many-core GPU architecture.. Price, performance, and evolution.
UC Berkeley 1 Time dilation in RAMP Zhangxi Tan and David Patterson Computer Science Division UC Berkeley.
Embedded Computing From Theory to Practice November 2008 USTC Suzhou.
Projects Using gem5 ParaDIME (2012 – 2015) RoMoL (2013 – 2018)
Computing Platform Benchmark By Boonyarit Changaival King Mongkut’s University of Technology Thonburi (KMUTT)
© 2005, it - instituto de telecomunicações. Todos os direitos reservados. System Level Resource Discovery and Management for Multi Core Environment Javad.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
A Flexible Architecture for Simulation and Testing (FAST) Multiprocessor Systems John D. Davis, Lance Hammond, Kunle Olukotun Computer Systems Lab Stanford.
GPU-Qin: A Methodology For Evaluating Error Resilience of GPGPU Applications Bo Fang , Karthik Pattabiraman, Matei Ripeanu, The University of British.
Lecture 2 : Introduction to Multicore Computing Bong-Soo Sohn Associate Professor School of Computer Science and Engineering Chung-Ang University.
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Accelerating Simulation of Agent-Based Models on Heterogeneous Architectures.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.
Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute.
MIAOW: An Open Source RTL Implementation of a GPGPU
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
Many-SC Project Runtime Environment (RTE) CSAP Lab 2014/10/28.
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY Ocelot and the SST-MacSim Simulator Genie.
Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015.
P-GAS: Parallelizing a Many-Core Processor Simulator Using PDES Huiwei Lv, Yuan Cheng, Lu Bai, Mingyu Chen, Dongrui Fan, Ninghui Sun Institute of Computing.
Simics: A Full System Simulation Platform Synopsis by Jen Miller 19 March 2004.
Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs? Wasim Shaikh Date: 10/29/2015.
Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)
Teaching The Principles Of System Design, Platform Development and Hardware Acceleration Tim Kranich
Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 1.
Understanding Parallel Computers Parallel Processing EE 613.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
1 Power-Aware System on a Chip A. Laffely, J. Liang, R. Tessier, C. A. Moritz, W. Burleson University of Massachusetts Amherst Boston Area Architecture.
Some GPU activities at the CMS experiment Felice Pantaleo EP-CMG-CO EP-CMG-CO 1.
3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.
Fast Energy Evaluation of Embedded Applications for Many-core Systems Felipe Rosa, Luciano Ost, Thiago Raupp, Fernando Moraes, Ricardo Reis.
Constructing a system with multiple computers or processors 1 ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson. Jan 13, 2016.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
SPRING 2012 Assembly Language. Definition 2 A microprocessor is a silicon chip which forms the core of a microcomputer the concept of what goes into a.
Lecture 2: Performance Evaluation
Lynn Choi School of Electrical Engineering
How to Quick Start Virtual Platform Development
MV5: A RECONFIGURABLE SIMULATOR FOR HETEROGENEOUS MULTICORE ARCHITECTURES Jiayuan Meng*, Kevin Skadron University of Virginia * Now at Argonne National.
Constructing a system with multiple computers or processors
Leiming Yu, Fanny Nina-Paravecino, David Kaeli, Qianqian Fang
Collaborative Computing for Heterogeneous Integrated Systems
Simulation at NASA for the Space Radiation Effort
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
Introduction to Heterogeneous Parallel Computing
Computer Evolution and Performance
Multicore and GPU Programming
Types of Parallel Computers
CSE 502: Computer Architecture
Presentation transcript:

System Simulation Of 1000-cores Heterogeneous SoCs Shivani Raghav Embedded System Laboratory (ESL) Ecole Polytechnique Federale de Lausanne (EPFL)

ESL Work on Energy-Aware Datacenter Design 2 System Simulation for many-core

Emerging Data-Intensive Workloads Cloud Servers Molecular Dynamics Monte Carlo Simulations Gene Sequencing Online Gaming Services Financial Simulations Medical Imaging

Demand for Hardware Acceleration Tile based Manycores Intel SCC, Tile 64 (Integrated) GPU Clusters (off –chip Accelerators) Hybrid Cores AMD Fusion (on-chip)

Urgent Need for Simulation of Heterogeneous SoCs Thermal & Power Evaluations Benchmarking Profiling Debugging Design Space Exploration Early Software Development Simulation

How to Design a Fast and Scalable Many-Core Simulator? Parallel Target Parallel Simulator Parallel Host

Simulating Parallel Target on Parallel Host is an Old Technology… FPGA GPGPU Flexus RAMP Opportunity WWT II Graphite Cotson, OVPSim Large Parallel Systems

Target Architecture Data-Parallel Coprocessors Simple In-order Cores 1000s of cores in a tile network Fine grain parallelism Core Caches Memory Switch

Solution – Accelerating Simulation using GPGPUs Target ArchitectureHost Platform A Perfect Match

Outline Problem Overview Simulation of Heterogeneous SoCs Solution SIMinG-1k (GPU accelerated simulator) Evaluation Summary

Outline Problem Overview Simulation of Heterogeneous SoCs Solution SIMinG-1k: A GPU accelerated simulator Evaluation Summary

Overall Simulation Framework Host Platform Sequential Code Data Parallel Code Simulator Target Architecture General Purpose CPU General Purpose CPU Many-Core Accelerator Application

SIMinG-1k - Features Instruction Accurate Inexpensive and Easily Available Fast Development Cycle Equation Performance Model Portability (Target Independent) Interpretation based core-simulation

Challenges of using GPU as a host SIMT (Single inst multiple threads) Divergent Code is a problem Synchronization outside thread block Slow CPU-GPU communication Global Memory is slow and limited

Outline Problem Overview Simulation of Heterogeneous SoCs Solution SIMinG-1k (GPU accelerated simulator) Evaluation Summary

Results – Architecture 1 MIPS - Number of simulated instruction in host wall clock time ARM ISA Data Scratchpad Single tile of target Accelerator Inst Scratchpad

Speed Up – Architecture 1 Speedup compared to simulation on OVPSim (thousands of ARM cores)

Single tile of Data-parallel Accelerator (cores, caches, on-chip interconnect) Results – Architecture 2 Core Caches Memory Switch

Speed Up – Architecture 2 Speedup compared to serial simulation on QEMU

Outline Problem Overview Simulation of Heterogeneous SoCs Solution SIMinG-1k (GPU accelerated simulator) Evaluation Summary

Conclusion  Challenge Fast and parallel simulator for heterogeneous SoCs  Solution Parallelize 1000 core simulation using GPUs  Design Full System Simulation using QEMU and SIMinG-1k  Results High Scalability and speedup upto 4096 cores  Extend the simulator for thermal and power evaluations  Complete simulation of Cloud Data Centers Future Work

Thanks! Questions?