Cross-Architecture Performance Prediction (XAPP): Using CPU to predict GPU Performance Newsha Ardalani Clint Lestourgeon Karthikeyan Sankaralingam Xiaojin.

Slides:

Advertisements

Similar presentations

Refining High Performance FORTRAN Code from Programming Model Dependencies Ferosh Jacob University of Alabama Department of Computer Science

Advertisements

Brainy: Effective Selection of Data Structures. Why data structure selection is important How to choose the best data structure for a specific application.

1 Qilin: Exploiting Parallelism on Heterogeneous Multiprocessors with Adaptive Mapping Chi-Keung (CK) Luk Technology Pathfinding and Innovation Software.

Overview Motivation Scala on LLVM Challenges Interesting Subsets.

Chimera: Collaborative Preemption for Multitasking on a Shared GPU

Pipelined Profiling and Analysis on Multi-core Systems Qin Zhao Ioana Cutcutache Weng-Fai Wong PiPA.

Empowering visual categorization with the GPU Present by 陳群元我是強壯 !

My view of challenges faced by Open64 Xiaoming Li University of Delaware.

Dynamic Tainting for Deployed Java Programs Du Li Advisor: Witawas Srisa-an University of Nebraska-Lincoln 1.

1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.

Research Directions for On-chip Network Microarchitectures Luca Carloni, Steve Keckler, Robert Mullins, Vijay Narayanan, Steve Reinhardt, Michael Taylor.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

Instrumentation and Measurement CSci 599 Class Presentation Shreyans Mehta.

By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and

Slide 1/8 Performance Debugging for Highly Parallel Accelerator Architectures Saurabh Bagchi ECE & CS, Purdue University Joint work with: Tsungtai Yeh,

An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos.

Waleed Alkohlani 1, Jeanine Cook 2, Nafiul Siddique 1 1 New Mexico Sate University 2 Sandia National Laboratories Insight into Application Performance.

Adopting Multi-Valued Logic for Reduced Pin-Count Testing Baohu Li, Bei Zhang and Vishwani Agrawal Auburn University, ECE Dept., Auburn, AL 36849, USA.

Predicting performance of applications and infrastructures Tania Lorido 27th May 2011.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

SYNAR Systems Networking and Architecture Group Scheduling on Heterogeneous Multicore Processors Using Architectural Signatures Daniel Shelepov and Alexandra.

Analyzing parallel programs with Pin Moshe Bach, Mark Charney, Robert Cohn, Elena Demikhovsky, Tevi Devor, Kim Hazelwood, Aamer Jaleel, Chi- Keung Luk,

GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.

Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

PMaC Performance Modeling and Characterization Performance Modeling and Analysis with PEBIL Michael Laurenzano, Ananta Tiwari, Laura Carrington Performance.

Jump to first page (c) 1999, A. Lakhotia 1 Software engineering? Arun Lakhotia University of Louisiana at Lafayette Po Box Lafayette, LA 70504, USA.

Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

© 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Profiling and Modeling Resource Usage.

Low Power and Energy Optimization Techniques Zack Smaridge Everett Salley 1/42.

Mantis: Automatic Performance Prediction for Smartphone Applications Yongin Kwon, Sangmin Lee, Hayoon Yi, Donghyun Kwon, Seungjun Yang, Byung-Gon Chun,

Summary Background –Why do we need parallel processing? Moore’s law. Applications. Introduction in algorithms and applications –Methodology to develop.

Adaptive Multi-Threading for Dynamic Workloads in Embedded Multiprocessors 林鼎原 Department of Electrical Engineering National Cheng Kung University Tainan,

Active Sampling for Accelerated Learning of Performance Models Piyush Shivam, Shivnath Babu, Jeff Chase Duke University.

Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.

Yang Yu, Tianyang Lei, Haibo Chen, Binyu Zang Fudan University, China Shanghai Jiao Tong University, China Institute of Parallel and Distributed Systems.

Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron.

Design Issues of Prefetching Strategies for Heterogeneous Software DSM Author :Ssu-Hsuan Lu, Chien-Lung Chou, Kuang-Jui Wang, Hsiao-Hsi Wang, and Kuan-Ching.

Platform Abstraction Group 3. Question How to deal with different types hardware and software platforms? What detail to expose to the programmer? What.

BarrierWatch: Characterizing Multithreaded Workloads across and within Program-Defined Epochs Socrates Demetriades and Sangyeun Cho Computer Frontiers.

1 Exploiting Nonstationarity for Performance Prediction Christopher Stewart (University of Rochester) Terence Kelly and Alex Zhang (HP Labs)

A Software Performance Monitoring Tool Daniele Francesco Kruse March 2010.

An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)

CISC Machine Learning for Solving Systems Problems Presented by: Eunjung Park Dept of Computer & Information Sciences University of Delaware Solutions.

Machine Learning in Compiler Optimization By Namita Dave.

High Throughput and Programmable Online Traffic Classifier on FPGA Author: Da Tong, Lu Sun, Kiran Kumar Matam, Viktor Prasanna Publisher: FPGA 2013 Presenter:

Jason Jong Kyu Park1, Yongjun Park2, and Scott Mahlke1

Compilers and Interpreters

Best detection scheme achieves 100% hit detection with

GPGPU Performance and Power Estimation Using Machine Learning Gene Wu – UT Austin Joseph Greathouse – AMD Research Alexander Lyashevsky – AMD Research.

Automatic CPU-GPU Communication Management and Optimization Thomas B. Jablin,Prakash Prabhu. James A. Jablin, Nick P. Johnson, Stephen R.Breard David I,

Troubleshooting Dennis Shasha and Philippe Bonnet, 2013.

Profiling/Tracing Method and Tool Evaluation Strategy Summary Slides Hung-Hsun Su UPC Group, HCS lab 1/25/2005.

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

Reza Yazdani Albert Segura José-María Arnau Antonio González

Chilimbi, et al. (2014) Microsoft Research

ISPASS th April Santa Rosa, California

R SE to the challenges of ntelligent systems

Genomic Data Clustering on FPGAs for Compression

A Cloud System for Machine Learning Exploiting a Parallel Array DBMS

The Yin and Yang of Processing Data Warehousing Queries on GPUs

Summary Background Introduction in algorithms and applications

CMSC 611: Advanced Computer Architecture

RHMD: Evasion-Resilient Hardware Malware Detectors

Introduction to CUDA.

CMSC 611: Advanced Computer Architecture

TEE-Perf A Profiler for Trusted Execution Environments

Sculptor: Flexible Approximation with

Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Presentation transcript:

Cross-Architecture Performance Prediction (XAPP): Using CPU to predict GPU Performance Newsha Ardalani Clint Lestourgeon Karthikeyan Sankaralingam Xiaojin Zhu University of Wisconsin-Madison

Executive Summary  Problem: GPU programming is challenging  Not clear how much speedup is achievable  Goal: Save programmers’ time from unnecessary porting effort  Insight:  Speedup is correlated with program properties and hardware characteristics.  Machine learning can learn the correlation  Results:  27% relative error across 24 benchmarks 2

Outline  Problem Statement  Insight  Overviews  Machine Learning Technique  Results  Future work 3

Problem Statement For a given GPU platform, predict optimized GPU execution time obtainable for any CPU application prior to starting the code development process 4

Insight  GPU performance GPU hardware characteristics  Number of cores  GPU Performance Inherent program properties  Available parallelism: Inherent program property  Cache miss: platform-dependent program property  GPU Execution time = 5

GPU Program CPU Program 6 GPU Platform x GPU Execution Time … … Feature Vector GPU Program Training data Machine Learning Model F x ( feature vector) CPU Program Predicted GPU Execution Time Overview Dynamic Binary Instrumentation

Prediction 1 Prediction 2 Prediction P Machine Learning Technique: 7 Training Set n m Training Subset 1 m Training Subset 2 m Training Subset p … … Step-wise Regression Majority Selection

Outline  Problem Statement  Insight  Overviews  Machine Learning Technique  Results  Future work 8

Results 9 QuestionsMetricsResults How close is predicted speedup to actual speedup? AccuracyHigh How robust across different GPU platforms? RobustnessHigh How much programmer involvement is required? UsabilityHigh How quickly provide prediction?SpeedHigh

Experimental Setup  Accuracy results on GPU GTX 750  Training Set: 112 datapoints Rodinia, Lonestar, Parboil, Parsec Subset, NAS subset  Test Set: 24 datapoints  Feature Vector: 31 program properties  Program properties collected using MICA and Pin  Execution time measured on real GPU hardware 10

Accuracy results 11  Accuracy on platform 1 (GTX 750): 27% relative error  Accuracy on platform 2 (GTX 660): 36% relative error

GPU Program CPU Program 12 CPU Program GPU Platform x GPU Execution Time … … Feature Vector GPU Program Training data Machine Learning Model F x ( feature vector) Provided How to use our tool? Model Construction Phase

How to use our tool? Usage Phase 13 CPU Program STOP START Dynamic Binary Instrumentation F x ( feature vector) Speedup Prediction

GPU Program CPU Program 14 CPU Program Dynamic Binary InstrumentationGPU Platform x GPU Execution Time … … Feature Vector GPU Program Training data Machine Learning Model F x ( feature vector) 0 min 30 min ~ 2.5 hr One-time Cost Overhead

GPU Program CPU Program 15 CPU Program Dynamic Binary InstrumentationGPU Platform x GPU Execution Time … … Feature Vector GPU Program Training data Machine Learning Model F x ( feature vector) CPU Program Dynamic Binary Instrumentation Predicted GPU Execution Time ~ 1 ms Recurring Overhead

GPU Program CPU Program 16 CPU Program GPU Platform x … … GPU Program Training data Machine Learning Model F x ( feature vector) What it is not?

Also in the paper  Why Ensemble Prediction?  Model Interpretation  End-to-End Case Studies 17

Summary 18 CPU GPU CPU GPU FPGA XeonPHI

Questions? 19