Funded by the Horizon 2020 Framework Programme of the European Union

Slides:



Advertisements
Similar presentations
Chapter 3 Embedded Computing in the Emerging Smart Grid Arindam Mukherjee, ValentinaCecchi, Rohith Tenneti, and Aravind Kailas Electrical and Computer.
Advertisements

Technology Drivers Traditional HPC application drivers – OS noise, resource monitoring and management, memory footprint – Complexity of resources to be.
Link-Time Path-Sensitive Memory Redundancy Elimination Manel Fernández and Roger Espasa Computer Architecture Department Universitat.
Pooja ROY, Manmohan MANOHARAN, Weng Fai WONG National University of Singapore ESWEEK (CASES) October 2014 EnVM : Virtual Memory Design for New Memory Architectures.
Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.
Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.
Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.
Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture
Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.
© 2012 IBM Corporation Barcelona Supercomputing Center MICRO 2012 Tuesday, December 4, 2012 Systematic Energy Characterization of CMP/SMT Processor Systems.
The Energy Case for Graph Processing on Hybrid Platforms Abdullah Gharaibeh, Lauro Beltrão Costa, Elizeu Santos-Neto and Matei Ripeanu NetSysLab The University.
Performance Counter Based Architecture Level Power Modeling ( ) MethodologyResults Motivation & Goals Processor power is increasing.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 5, 2005 Lecture 2.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Compilation Techniques for Energy Reduction in Horizontally Partitioned Cache Architectures Aviral Shrivastava, Ilya Issenin, Nikil Dutt Center For Embedded.
CISC673 – Optimizing Compilers1/34 Presented by: Sameer Kulkarni Dept of Computer & Information Sciences University of Delaware Phase Ordering.
1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.
Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.
CISC Machine Learning for Solving Systems Problems Arch Explorer Lecture 5 John Cavazos Dept of Computer & Information Sciences University of Delaware.
SYNAR Systems Networking and Architecture Group Scheduling on Heterogeneous Multicore Processors Using Architectural Signatures Daniel Shelepov and Alexandra.
Feb. 19, 2008 Multicore Processor Technology and Managing Contention for Shared Resource Cong Zhao Yixing Li.
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
Ramazan Bitirgen, Engin Ipek and Jose F.Martinez MICRO’08 Presented by PAK,EUNJI Coordinated Management of Multiple Interacting Resources in Chip Multiprocessors.
Analytic Models and Empirical Search: A Hybrid Approach to Code Optimization A. Epshteyn 1, M. Garzaran 1, G. DeJong 1, D. Padua 1, G. Ren 1, X. Li 1,
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
Predictive Design Space Exploration Using Genetically Programmed Response Surfaces Henry Cook Department of Electrical Engineering and Computer Science.
A Frequency-Domain Approach to Registration Estimation in 3-D Space Phillip Curtis Pierre Payeur Vision, Imaging, Video and Autonomous Systems Research.
ICOM 6115: Computer Systems Performance Measurement and Evaluation August 11, 2006.
Computer Science Department In-N-Out: Reproducing Out-of-Order Superscalar Processor Behavior from Reduced In-Order Traces Kiyeon Lee and Sangyeun Cho.
Embedded System Lab. 정범종 A_DRM: Architecture-aware Distributed Resource Management of Virtualized Clusters H. Wang et al. VEE, 2015.
Simics: A Full System Simulation Platform Synopsis by Jen Miller 19 March 2004.
CISC Machine Learning for Solving Systems Problems Microarchitecture Design Space Exploration Lecture 4 John Cavazos Dept of Computer & Information.
Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur.
KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association SYSTEM ARCHITECTURE GROUP DEPARTMENT OF COMPUTER.
M AESTRO : Orchestrating Predictive Resource Management in Future Multicore Systems Sangyeun Cho, Socrates Demetriades Computer Science Department University.
F1-17: Architecture Studies for New-Gen HPC Systems
Lynn Choi School of Electrical Engineering
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
Reza Yazdani Albert Segura José-María Arnau Antonio González
Seth Pugsley, Jeffrey Jestes,
NVIDIA’s Extreme-Scale Computing Project
CHAINSAW Von-Neumann Accelerators To Leverage Fused Instruction Chains
Ramya Kandasamy CS 147 Section 3
Understanding Latency Variation in Modern DRAM Chips Experimental Characterization, Analysis, and Optimization Kevin Chang Abhijith Kashyap, Hasan Hassan,
Resource Aware Scheduler – Initial Results
Department of Electrical & Computer Engineering
Computer Architecture 2
Accelerating Linked-list Traversal Through Near-Data Processing
Accelerating Linked-list Traversal Through Near-Data Processing
Hyperthreading Technology
Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor
Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks Amirali Boroumand, Saugata Ghose, Youngsok Kim, Rachata Ausavarungnirun, Eric.
Xia Zhao*, Zhiying Wang+, Lieven Eeckhout*
Application Slowdown Model
CSCI206 - Computer Organization & Programming
Mingxing Zhang, Youwei Zhuo (equal contribution),
Exploring Non-Uniform Processing In-Memory Architectures
Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
MICRO-50 Swamit Tannu Zachary Myers Douglas Carmean Prashant Nair
Realizing Closed-loop, Online Tuning and Control for Configurable-Cache Embedded Systems: Progress and Challenges Islam S. Badreldin*, Ann Gordon-Ross*,
Automatic Tuning of Two-Level Caches to Embedded Applications
Model Compression Joseph E. Gonzalez
Active-Routing: Compute on the Way for Near-Data Processing
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
Computer Architecture Lecture 10a: Simulation
Gagandeep Singh, Juan Gomez-Luna, Giovanni Mariani, Geraldo F
BitMAC: An In-Memory Accelerator for Bitvector-Based
Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project
Presentation transcript:

NAPEL: Near-Memory Computing Application Performance Prediction via Ensemble Learning Funded by the Horizon 2020 Framework Programme of the European Union MSCA-ITN-EID   Gagandeep Singh, Juan Gomez-Luna, Giovanni Mariani, Geraldo F. Oliveira, Stefano Corda, Sander Stuijk, Onur Mutlu, Henk Corporaal 56th Design Automation Conference (DAC), Las Vegas 4th-June-2019

Executive Summary Motivation: A promising paradigm to alleviate data movement bottleneck is near- memory computing (NMC), which consists of placing compute units close to the memory subsystem Problem: Simulation times are extremely slow, imposing long run-time especially in the early-stage design space exploration Goal: A quick high-level performance and energy estimation framework for NMC architectures Our contribution: NAPEL Fast and accurate performance and energy prediction for previously-unseen applications using ensemble learning Use intelligent statistical techniques and micro-architecture-independent application features to minimize experimental runs Evaluation NAPEL is, on average, 220x faster than state-of-the-art NMC simulator Error rates (average) of 8.5% and 11.5% for performance and energy estimation We open source Ramulator-PIM: https://github.com/CMU-SAFARI/ramulator-pim/

SKA 300PB 180PB 98PB uploads on searches on 15PB 15PB 3PB Michael Wise, ASTRON, ”Science data Centre challenges”, DOME Symposium, 18 May, 2017

Massive amounts of data SKA 300PB uploads on 180PB searches on 98PB Massive amounts of data 15PB 15PB 3PB Michael Wise, ASTRON, ”Science data Centre challenges”, DOME Symposium, 18 May, 2017

Compute Centric Approach Processor Integer core Data Access Memory hierarchies take advantage of locality Spatial locality Temporal locality Not suitable for all workloads Graph processing Neural networks Data access consumes a major part Applications are increasingly data hungry Data movement energy dominates compute Especially true for off-chip movement Data Movement DDR I/O DDR chip link System-level power break down* * R. Nair et al., “Active memory cube: A processing-in memory architecture for exascale systems”, IBM J. Research Develop., vol. 59, no. 2/3, 2015

Compute Centric Approach Processor Integer core Data Access Memory hierarchies take advantage of locality Spatial locality Temporal locality Not suitable for all workloads Graph processing Neural networks Data access consumes a major part Applications are increasingly data hungry Data movement energy dominates compute Especially true for off-chip movement Data Movement Data movement bottleneck DDR I/O DDR chip link System-level power break down* * R. Nair et al., “Active memory cube: A processing-in memory architecture for exascale systems”, IBM J. Research Develop., vol. 59, no. 2/3, 2015

Paradigm Shift - NMC Compute-centric to a data-centric approach Biggest enabler – stacking technology

NMC Simulators Simulation for: NMC Simulators: Design space exploration (DSE) Workload suitability analysis NMC Simulators: Sinuca, 2015 HMC-SIM, 2016 CasHMC, 2016 Smart Memory Cube (SMC), 2016 CLAPPS, 2017 Gem5+HMC, 2017 Ramulator-PIM1, 2019 1Ramulator-PIM: https://github.com/CMU-SAFARI/ramulator-pim/

NMC Simulators Simulation for: Design space exploration (DSE) Workload suitability analysis NMC Simulators: Sinuca, 2015 HMC-SIM, 2016 CasHMC, 2016 Smart Memory Cube (SMC), 2016 CLAPPS, 2017 Gem5+HMC, 2017 Ramulator-PIM1, 2019 Simulation of real workloads can be 10000x slower than native-execution!!! 1Ramulator-PIM: https://github.com/CMU-SAFARI/ramulator-pim/

NMC Simulators Simulation for: Design space exploration (DSE) Workload suitability analysis NMC Simulators: Sinuca, 2015 HMC-SIM, 2016 CasHMC, 2016 Smart Memory Cube (SMC), 2016 CLAPPS, 2017 Gem5+HMC, 2017 Ramulator-PIM1, 2019 Idea: Leverage ML with statistical techniques for quick NMC performance/energy prediction 1Ramulator-PIM: https://github.com/CMU-SAFARI/ramulator-pim/

NAPEL: Near-Memory Computing Application Performance Prediction via Ensemble Learning NAPEL Model Training

Phase 1: LLVM Analyzer

Phase 2: Microarchitecture Simulation Central composite design of experiments technique to minimize the number of experiments while data collection

Phase 3: Ensemble ML Training Application Features Instruction Mix ILP Reuse distance Memory traffic Register traffic Memory footprint Architecture Features Core type #PEs Core frequency Cache line size DRAM layers Cache access fraction DRAM access fraction

NAPEL Framework

NAPEL Prediction

Experimental Setup Host System NMC Subsystem Workloads IBM POWER9 Power: AMESTER NMC Subsystem Ramulator-PIM1 Workloads PolyBench and Rodinia Heterogeneous workloads such as image processing, machine learning, graph processing etc. Accuracy in terms of mean relative error (MRE) 1https://github.com/CMU-SAFARI/ramulator-pim/

NAPEL Accuracy: Performance and Energy Estimates (a) Performance prediction (b) Energy prediction

NAPEL Accuracy: Performance and Energy Estimates (a) Performance prediction (b) Energy prediction MRE of 8.5% and 11.6% for performance and energy

Speed of Evaluation 256 DoE configurations for 12 evaluated applications 256 1

220x (up to 1039x) faster than NMC simulator Speed of Evaluation 256 DoE configurations for 12 evaluated applications 256 1 220x (up to 1039x) faster than NMC simulator

Use Case: NMC Suitability Analysis Assess the potential of offloading a workload to NMC NAPEL provides accurate prediction of NMC suitability MRE between 1.3% to 26.3% (average 14.1%)

Conclusion and Summary Motivation: A promising paradigm to alleviate data movement bottleneck is near- memory computing (NMC), which consists of placing compute units close to the memory subsystem Problem: Simulation times are extremely slow, imposing long run-time especially in the early-stage design space exploration Goal: A quick high-level performance and energy estimation framework for NMC architectures Our contribution: NAPEL Fast and accurate performance and energy prediction for previously-unseen applications using ensemble learning Use intelligent statistical techniques and micro-architecture-independent application features to minimize experimental runs Evaluation NAPEL is, on average, 220x faster than state-of-the-art NMC simulator Error rates (average) of 8.5% and 11.5% for performance and energy estimation We open source Ramulator-PIM: https://github.com/CMU-SAFARI/ramulator-pim/

NAPEL: Near-Memory Computing Application Performance Prediction via Ensemble Learning Funded by the Horizon 2020 Framework Programme of the European Union MSCA-ITN-EID   Gagandeep Singh, Juan Gomez-Luna, Giovanni Mariani, Geraldo F. Oliveira, Stefano Corda, Sander Stuijk, Onur Mutlu, Henk Corporaal 56th Design Automation Conference (DAC), Las Vegas 4th-June-2019