2-6-2015 Challenge the future Delft University of Technology Evaluating Multi-Core Processors for Data-Intensive Kernels Alexander van Amesfoort Delft.

Slides:

Advertisements

Similar presentations

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

Advertisements

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC.

Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

Cost-based Workload Balancing for Ray Tracing on a Heterogeneous Platform Mario Rincón-Nigro PhD Showcase Feb 17 th, 2012.

The Energy Case for Graph Processing on Hybrid Platforms Abdullah Gharaibeh, Lauro Beltrão Costa, Elizeu Santos-Neto and Matei Ripeanu NetSysLab The University.

Presented by Performance and Productivity of Emerging Architectures Jeremy Meredith Sadaf Alam Jeffrey Vetter Future Technologies.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.

DCABES 2009 China University Of Geosciences 1 The Parallel Models of Coronal Polarization Brightness Calculation Jiang Wenqian.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Weekly Report Start learning GPU Ph.D. Student: Leo Lee date: Sep. 18, 2009.

Real-World GPGPU Mark Harris NVIDIA Developer Technology.

Linear Algebra on GPUs Vasily Volkov. GPU Architecture Features SIMD architecture – Don’t be confused by scalar ISA which is only a program model We use.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

Adnan Ozsoy & Martin Swany DAMSL - Distributed and MetaSystems Lab Department of Computer Information and Science University of Delaware September 2011.

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.

Massively LDPC Decoding on Multicore Architectures Present by : fakewen.

Supporting GPU Sharing in Cloud Environments with a Transparent

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.

Christopher Mitchell CDA 6938, Spring The Discrete Cosine Transform  In the same family as the Fourier Transform  Converts data to frequency domain.

1 Chapter 04 Authors: John Hennessy & David Patterson.

Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.

GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.

Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors

GPU in HPC Scott A. Friedman ATS Research Computing Technologies.

Computación algebraica dispersa con GPUs y su aplicación en tomografía electrónica Non-linear iterative optimization method for locating particles using.

1 © 2012 The MathWorks, Inc. Parallel computing with MATLAB.

YOU LI SUPERVISOR: DR. CHU XIAOWEN CO-SUPERVISOR: PROF. LIU JIMING THURSDAY, MARCH 11, 2010 Speeding up k-Means by GPUs 1.

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

Use of GPUs in ALICE (and elsewhere) Thorsten Kollegger TDOC-PG | CERN |

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

VTU – IISc Workshop Compiler, Architecture and HPC Research in Heterogeneous Multi-Core Era R. Govindarajan CSA & SERC, IISc

Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,

CSE 690: GPGPU Lecture 7: Matrix Multiplications Klaus Mueller Computer Science, Stony Brook University.

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

Experiences Accelerating MATLAB Systems Biology Applications Heart Wall Tracking Lukasz Szafaryn, Kevin Skadron University of Virginia.

Scalable Multi-core Sonar Beamforming with Computational Process Networks Motivation Sonar beamforming requires significant computation and input/output.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

Developing the Demosaicing Algorithm in GPGPU Ping Xiang Electrical engineering and computer science.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Program Optimizations and Recent Trends in Heterogeneous Parallel Computing Dušan Gajić, University of Niš Program Optimizations and Recent Trends in Heterogeneous.

Auto-tuning Dense Matrix Multiplication for GPGPU with Cache

Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,

David Angulo Rubio FAMU CIS GradStudent. Introduction  GPU(Graphics Processing Unit) on video cards has evolved during the last years. They have become.

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

Sparse Matrix-Vector Multiply on the Keystone II Digital Signal Processor Yang Gao, Fan Zhang and Dr. Jason D. Bakos 2014 IEEE High Performance Extreme.

Sunpyo Hong, Hyesoon Kim

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

My Coordinates Office EM G.27 contact time:

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.

LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

Clusters of Computational Accelerators

Multi-/Many-Core Processors

Course Outline Introduction in algorithms and applications

Multicore and GPU Programming

Multicore and GPU Programming

Presentation transcript:

Challenge the future Delft University of Technology Evaluating Multi-Core Processors for Data-Intensive Kernels Alexander van Amesfoort Delft University of Technology joint work with Ana Varbanescu, Rob van Nieuwpoort, and Henk Sips

2 Evaluating Multi-Core Processors for Data-Intensive Kernels Outline Data-intensive applications Gridding Platforms Implementation strategies Measurements Guidelines Conclusion

3 Evaluating Multi-Core Processors for Data-Intensive Kernels Data-Intensive Applications Low Arithmetic Intensity (comp : comm) -drastic application- and platform-specific effort Difficult platform choice Memory wall is still getting bigger Data-intensive study worthwhile Provide guidelines and insight into performance behavior and effort

4 Evaluating Multi-Core Processors for Data-Intensive Kernels Radio Astronomy Imaging Gridding places irregularly spaced samples on a regular grid (de)gridding consumes most time in imaging Use gridding as a HPC streaming kernel

5 Evaluating Multi-Core Processors for Data-Intensive Kernels Gridding (W Projection) Unpredictable, sparse access patterns Low AI (0.33) forall (i = 0..N_freq; j = 0..N_samples)// for all samples g_index = func1((u, v, w)[j], freq[i]); c_index = func2((u, v, w)[j], freq[i]); for (x = 0; x < SUPPORT; x++)// sweep the convolution kernel G[g_index+x] += C[C_index+x] * V[i,j]; Parameterize these properties

6 Evaluating Multi-Core Processors for Data-Intensive Kernels Platforms and Test Setup High provided Flop/Byte ratios PlatformCoresClock (GHz) Local Mem (kB) Compute (GFlop/s) Flop/Byte Ratio Dual Xeon 53202x L1 2 x 4096 L Core i (HT) L1 256 L L PS3 Cell QS21 Cell Geforce 8800 GTX Geforce GTX

7 Evaluating Multi-Core Processors for Data-Intensive Kernels Implementation Strategies CPU (pthreads) -replicated grid, master-worker queues, SIMD Cell/B.E. (Cell SDK) -master-worker queues, SIMD, double buffering, PPE multi-threading, line reuse GPU (CUDA) -replicated grid, 1D texturing of the convolution matrix Similar at a high level -but different, non-portable code

8 Evaluating Multi-Core Processors for Data-Intensive Kernels CPU Experiments Core i7 suffers less from irregular accesses Still 3x more locality needed Hyperthreading shows a lot of benefit

9 Evaluating Multi-Core Processors for Data-Intensive Kernels Cell/B.E. Experiments Achieves the highest performance Could perform much better with more work Some optimizations were applied to the computation

10 Evaluating Multi-Core Processors for Data-Intensive Kernels GPU Experiments Write conflicts in the grid problematic Also requires much more work/locality Tesla C1040 results unexplainable

11 Evaluating Multi-Core Processors for Data-Intensive Kernels Discussion Reached good speedups, but still way below peak A lot of effort Best performance on Cell/B.E. -depends on application requirements GPUs suitable for lots of data parallelism -can exploit 2D or 3D spatial locality Don’t underestimate standard CPUs -flexibility, availability, cost, and ease of programming

12 Evaluating Multi-Core Processors for Data-Intensive Kernels Guidelines Good performance requires: -regular data accesses -data reuse between independent samples Or else: suffer (redesign algorithm) -conceptually, resolve irregularity at a higher level -avoid write conflicts -stream jobs: overlap/multi-buffering in the hierarchy -parameterized job size

13 Evaluating Multi-Core Processors for Data-Intensive Kernels Conclusion Challenges: -platform choice -fitting the application onto the platform Similar strategies, different implementation Provided guidelines focussing at memory and data optimization -or change the algorithm