The Problem Finding a needle in haystack An expert (CPU)

Slides:



Advertisements
Similar presentations
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.
Advertisements

Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.
Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
SE263 Video Analytics Course Project Initial Report Presented by M. Aravind Krishnan, SERC, IISc X. Mei and H. Ling, ICCV’09.
Department of Computer Science iGPU: Exception Support and Speculative Execution on GPUs Jaikrishnan Menon, Marc de Kruijf Karthikeyan Sankaralingam Vertical.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
Analysis of Database Workloads on Modern Processors Advisor: Prof. Shan Wang P.h.D student: Dawei Liu Key Laboratory of Data Engineering and Knowledge.
A many-core GPU architecture.. Price, performance, and evolution.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Linear Algebra on GPUs Vasily Volkov. GPU Architecture Features SIMD architecture – Don’t be confused by scalar ISA which is only a program model We use.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Compressed Memory Hierarchy Dongrui SHE Jianhua HUI.
Basic Operational Concepts of a Computer
Conference title1 A New Methodology for Studying Realistic Processors in Computer Science Degrees Crispín Gómez, María E. Gómez y Julio Sahuquillo DISCA.
Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.
A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,
Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.
Unit 2 - Hardware Graphics Cards. Why do we need graphics cards? ● The processor executes commands for many different purposes. ● Graphics processing.
MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.
Optimizing Data Compression Algorithms for the Tensilica Embedded Processor Tim Chao Luis Robles Rebecca Schultz.
NVIDIA Tesla GPU Zhuting Xue EE126. GPU Graphics Processing Unit The "brain" of graphics, which determines the quality of performance of the graphics.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
A Closer Look At GPUs By Kayvon Fatahalian and Mike Houston Presented by Richard Stocker.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
Spring 2006 Wavescalar S. Swanson, et al. Computer Science and Engineering University of Washington Presented by Brett Meyer.
Dept. of Computer Science - CS6461 Computer Architecture CS6461 – Computer Architecture Fall 2015 Lecture 1 – Introduction Adopted from Professor Stephen.
Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.
CISC 879 : Advanced Parallel Programming Rahul Deore Dept. of Computer & Information Sciences University of Delaware Exploring Memory Consistency for Massively-Threaded.
Sunpyo Hong, Hyesoon Kim
ICC Module 3 Lesson 1 – Computer Architecture 1 / 11 © 2015 Ph. Janson Information, Computing & Communication Module 3 : Systems.
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
VU-Advanced Computer Architecture Lecture 1-Introduction 1 Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 1.
Software Design and Development Computer Architecture Computing Science.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, Andreas Moshovos University of Toronto Demystifying GPU Microarchitecture through Microbenchmarking.
Computer System Structures Storage
William Stallings Computer Organization and Architecture 6th Edition
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
NFV Compute Acceleration APIs and Evaluation
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
COSC3330 Computer Architecture
OCR GCSE Computer Science Teaching and Learning Resources
Chilimbi, et al. (2014) Microsoft Research
CMSC 611 Advanced Computer Arch.
Concurrent Data Structures for Near-Memory Computing
Multi-Processing in High Performance Computer Architecture:
Microbenchmarks for Memory Hierarchy
Computer Architecture 2
CS 286 Computer Organization and Architecture
CS775: Computer Architecture
A Cloud System for Machine Learning Exploiting a Parallel Array DBMS
Computer Science I CSC 135.
The Yin and Yang of Processing Data Warehousing Queries on GPUs
Chapter 17 Parallel Processing
CMSC 611 Advanced Computer Arch.
Embedded Computer Architecture 5SIA0 Overview
University of Wisconsin-Madison
Introduction to Teradata
/ Computer Architecture and Design
Intel Core I7 Pipeline Wei-Tse Sun.
병렬처리시스템 2005년도 2학기 채 수 환
Chapter 4 Multiprocessors
What Are Performance Counters?
Sculptor: Flexible Approximation with
Presentation transcript:

The Problem Finding a needle in haystack An expert (CPU) A group of non-experts (GPU)

Micro-benchmarking GPU micro-architectures Suhas Thejaswi Muniyappa Department of Computer Science Aalto University

Overview Micro-processor trend Micro-benchmarking CPU micro-processor trend GPU micro-processor trend Micro-benchmarking Pointer chase Fine-grain pointer chase Piecewise linear fine-grain pointer chase Hardware characteristics

CPU micro-processor trend Hardware support for advanced instructions. Availability of hardware documentation. Expensive hardware. No significant change in per-core performance over the decade. Parallelize the execution to achieve the speedup.

GPU micro-processor trend Low hardware cost. High arithmetic and memory bandwidth. Thousands of cores. Built for processing graphics. No hardware support for advanced instructions. Limited documentation of memory hierarchy. How to overcome the limitations of GPUs?

Micro-benchmarking Hacking into the system to reveal hardware details. Using access latency to determine hardware architecture. Details of memory system is necessary to achieve optimal hardware performance.

Pointer chase Saavedra et al. (1996) benchmarking approach for CPUs. Array element is initialized with index of next memory access. Access latency depends on the stride size. Average memory access latency is stored.

Fine-grain pointer chase Record and analyze every memory access latency. Mei and Chu (2016) designed fine-grain benchmarks for GPUs. Access latency stored in shared memory. Shared memory not sufficient for large arrays.

Piecewise fine-grain pointer chase Disk storage After each iteration shared memory contents are stored into disk. Sliding window approach to record access latency.

Hardware characteristics L1 cache Using the access latency the hardware characteristics are deduced.

Summary GPUs can be used for general purpose computations. GPUs provide an environment for executing algorithms which can scale. Details of memory system is necessary to achieve optimal hardware performance. Benchmarking reveals characteristics of the hardware, which is not revealed by the hardware manufacturers.

References [1] Mei, X., and Chu, X. Dissecting memory hierarchy through microbenchmarking. IEEE Transaction on Parallel and Distributed Systems Preprint, 99 (2016), 1. [2] Mei, X., Zhao, C., and Chu, X. Benchmarking the memory hierarchy of modern GPUs. Network and Parallel Computing: 11th IFIPWG 10.3 International Conference Proceedings (NPC) (2014), 144-156. [3] Saavedra, R.H., and Smith, A.J. Measuring cache and TLB performance and their effect on benchmark runtimes. IEEE transactions on computers 44, 10 (1995), 1223-1235. [4] Saavedra, R.H. CPU performance evaluation and execution time prediction using narrow spectrum benchmarking. PhD thesis, university of California, Berkley, 1992.

Questions ?

Thank you