Productive Performance Tools for Heterogeneous Parallel Computing Allen D. Malony Department of Computer and Information Science University of Oregon Shigeo.

Slides:



Advertisements
Similar presentations
Performance Analysis Tools for High-Performance Computing Daniel Becker
Advertisements

Issues of HPC software From the experience of TH-1A Lu Yutong NUDT.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Early Linpack Performance Benchmarking on IPE Mole-8.5 Fermi GPU Cluster Xianyi Zhang 1),2) and Yunquan Zhang 1),3) 1) Laboratory of Parallel Software.
Prof. Srinidhi Varadarajan Director Center for High-End Computing Systems.
University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.
PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.
Scalable Data Clustering with GPUs Student: Andrew D. Pangborn 1 Advisors: Dr. Muhammad Shaaban 1, Dr. Gregor von Laszewski 2, Dr. James Cavenaugh 3, Dr.
TAU Parallel Performance System DOD UGC 2004 Tutorial Allen D. Malony, Sameer Shende, Robert Bell Univesity of Oregon.
Scripting Languages For Virtual Worlds. Outline Necessary Features Classes, Prototypes, and Mixins Static vs. Dynamic Typing Concurrency Versioning Distribution.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
CS 732: Advance Machine Learning Usman Roshan Department of Computer Science NJIT.
Sameer Shende, Allen D. Malony Computer & Information Science Department Computational Science Institute University of Oregon.
Contemporary Languages in Parallel Computing Raymond Hummel.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
Supporting GPU Sharing in Cloud Environments with a Transparent
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
Performance Tools for GPU-Powered Scalable Heterogeneous Systems Allen D. Malony, Scott Biersdorff, Sameer Shende
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.
GePSeA: A General Purpose Software Acceleration Framework for Lightweight Task Offloading Ajeet SinghPavan BalajiWu-chun Feng Dept. of Computer Science,
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
WORK ON CLUSTER HYBRILIT E. Aleksandrov 1, D. Belyakov 1, M. Matveev 1, M. Vala 1,2 1 Joint Institute for nuclear research, LIT, Russia 2 Institute for.
CATIA V6 Live Rendering Need permission from Xavier Melkonian at 3DS before any NDA discussion with CATIA users. NVIDIA/mental images.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
Extracted directly from:
TRACEREP: GATEWAY FOR SHARING AND COLLECTING TRACES IN HPC SYSTEMS Iván Pérez Enrique Vallejo José Luis Bosque University of Cantabria TraceRep IWSG'15.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
Use/User:LabServerField Engineer Electrical Engineer Software Engineer Mechanical Engineer Requirements: Small form factor.
MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.
Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.
Parallelization: Area Under a Curve. AUC: An important task in science Neuroscience – Endocrine levels in the body over time Economics – Discounting:
General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.
Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,
Allen D. Malony 1, Scott Biersdorff 2, Wyatt Spear 2 1 Department of Computer and Information Science 2 Performance Research Laboratory University of Oregon.
4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.
Extreme Performance Engineering: Petascale and Heterogeneous Systems Allen D. Malony Department of Computer and Information Science University of Oregon.
Summary Background –Why do we need parallel processing? Moore’s law. Applications. Introduction in algorithms and applications –Methodology to develop.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
ARCHES: GPU Ray Tracing I.Motivation – Emergence of Heterogeneous Systems II.Overview and Approach III.Uintah Hybrid CPU/GPU Scheduler IV.Current Uintah.
A Software Framework for Distributed Services Michael M. McKerns and Michael A.G. Aivazis California Institute of Technology, Pasadena, CA Introduction.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
Allen D. Malony Department of Computer and Information Science TAU Performance Research Laboratory University of Oregon Discussion:
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Architecture View Models A model is a complete, simplified description of a system from a particular perspective or viewpoint. There is no single view.
Contemporary Languages in Parallel Computing Raymond Hummel.
Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.
Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs Allen D. Malony, Scott Biersdorff, Sameer Shende, Heike Jagode†, Stanimire.
CS 732: Advance Machine Learning
Current Research Overview Jeremy Espenshade 09/04/08.
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
11 Brian Van Straalen Portable Performance Discussion August 7, FASTMath SciDAC Institute.
Synergy.cs.vt.edu VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units Shucai Xiao 1, Pavan Balaji 2, Qian Zhu 3,
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Performance Tool Integration in Programming Environments for GPU Acceleration: Experiences with TAU and HMPP Allen D. Malony1,2, Shangkar Mayanglambam1.
Kai Li, Allen D. Malony, Sameer Shende, Robert Bell
Productive Performance Tools for Heterogeneous Parallel Computing
Performance Technology for Scalable Parallel Systems
Self Healing and Dynamic Construction Framework:
Department of Computer and Information Science
Allen D. Malony, Sameer Shende
A Domain Decomposition Parallel Implementation of an Elasto-viscoplasticCoupled elasto-plastic Fast Fourier Transform Micromechanical Solver with Spectral.
Allen D. Malony Computer & Information Science Department
Graphics Processing Unit
Presentation transcript:

Productive Performance Tools for Heterogeneous Parallel Computing Allen D. Malony Department of Computer and Information Science University of Oregon Shigeo Fukuda, Lunch with a Helmet On

SOS 2010 PanelMarch 9, 2010 Productive Performance Tools for Heterogeneous Computing Heterogeneous Implications for Performance Tools  Tools should support parallel computation models  Current status quo is comfortable  Mostly homogeneous parallel systems and software  Shared-memory multithreading – OpenMP  Distributed-memory message passing – MPI  Parallel computational models are relatively stable (simple)  Corresponding performance models are relatively tractable  Parallel performance tools are just keeping up  Heterogeneity creates richer computational potential  Results in greater performance diversity and complexity  Performance tools have to support richer computation models and broader (less constrained) performance perspectives

SOS 2010 PanelMarch 9, 2010 Productive Performance Tools for Heterogeneous Computing Heterogeneous Performance Perspective  Want to create performance views that capture heterogeneous concurrency and execution behavior  Reflect execution logic beyond standard actions  Capture performance semantics at multiple levels  Heterogeneous applications have concurrent execution  Want to capture performance for all execution paths  Consider “host” path and “external” paths  What perspective does the host have of the external entity?  Determines the semantics of the measurement data  Existing parallel performance tools are CPU(host)-centric  Event-based sampling (not appropriate for accelerators)  Probe-based measurement

SOS 2010 PanelMarch 9, 2010 Productive Performance Tools for Heterogeneous Computing Heterogeneous Performance Complexity Issues  Asynchronous execution (concurrency)  Memory transfer and memory sharing  Interactions between heterogeneous components  Interference between heterogeneous components  Different programming languages/libraries/tools  Availability of performance data ...

SOS 2010 PanelMarch 9, 2010 Productive Performance Tools for Heterogeneous Computing TAUcuda Performance Measurement  CUDA performance measurement  Integrated with TAU performance system  Built on experimental Linux CUDA driver (R190.86)  Captures CUDA device (cuXXX) events  Captures CUDA runtime (cudaYYY) events

SOS 2010 PanelMarch 9, 2010 Productive Performance Tools for Heterogeneous Computing TAUcuda Experimentation Environment  University of Oregon  Linux workstation  Dual quad core Intel Xeon  GTX 280  GPU cluster (Mist)  Four dual quad core Intel Xeon server nodes  Two S1070 Tesla servers (4 Tesla GPUs per S1070)  Argonne National Laboratory  100 dual quad core NVIDIA Quadro Plex S4  200 Quadro FX5600 (2 per S4)  University of Illinois at Urbana-Champaign  GPU cluster (AC cluster)  32 nodes with one S1070 (4 GPUs per node)

SOS 2010 PanelMarch 9, 2010 Productive Performance Tools for Heterogeneous Computing CUDA SDK OceanFFT kernels

SOS 2010 PanelMarch 9, 2010 Productive Performance Tools for Heterogeneous Computing CUDA Linpack (4 process, trace) MPI communication (yellow)CUDA memory transfer (white)

SOS 2010 PanelMarch 9, 2010 Productive Performance Tools for Heterogeneous Computing NAMD Performance Profile

SOS 2010 PanelMarch 9, 2010 Productive Performance Tools for Heterogeneous Computing Call for “Extreme” Performance Engineering  Strategy to respond to technology changes and disruptions  Strategy to carry forward performance expertise and knowledge  Built on robust, integrated performance measurement infrastructure  Model-oriented with knowledge-based reasoning  Community-driven knowledge engineering  Automated data / decision analytics  Requires interactions with all SW stack components

SOS 2010 PanelMarch 9, 2010 Productive Performance Tools for Heterogeneous Computing Model Oriented Global Optimization (MOGO)  Empirical performance data evaluated with respect to performance expectations at various levels of abstraction