Data Intensive Computing on Heterogeneous Platforms Norm Rubin Fellow GPG graphics products group AMD HPEC 2009.

Slides:



Advertisements
Similar presentations
Using Graphics Processors for Real-Time Global Illumination UK GPU Computing Conference 2011 Graham Hazel.
Advertisements

COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
ATI Stream Computing OpenCL™ Histogram Optimization Illustration Marc Romankewicz April 5, 2010.
Lecture 6: Multicore Systems
ATI Stream ™ Physics Neal Robison Director of ISV Relations, AMD Graphics Products Group Game Developers Conference March 26, 2009.
Cooperative Boosting: Needy versus Greedy Power Management INDRANI PAUL 1,2, SRILATHA MANNE 1, MANISH ARORA 1,3, W. LLOYD BIRCHER 1, SUDHAKAR YALAMANCHILI.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
Claude TADONKI Mines ParisTech – LAL / CNRS / INP 2 P 3 University of Oujda (Morocco) – October 7, 2011 High Performance Computing Challenges and Trends.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
Coordinated Energy Management in Heterogeneous Processors INDRANI PAUL 1,2, VIGNESH RAVI 1, SRILATHA MANNE 1, MANISH ARORA 1,3, SUDHAKAR YALAMANCHILI 2.
Panda: MapReduce Framework on GPU’s and CPU’s
Panel Discussion: The Future of I/O From a CPU Architecture Perspective #OFADevWorkshop Brad Benton AMD, Inc.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
GPGPU platforms GP - General Purpose computation using GPU
OpenCL Introduction A TECHNICAL REVIEW LU OCT
MACHINE VISION GROUP Graphics hardware accelerated panorama builder for mobile phones Miguel Bordallo López*, Jari Hannuksela*, Olli Silvén* and Markku.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
 Introduction to Operating System Introduction to Operating System  Types Of An Operating System Types Of An Operating System  Single User Single User.
Computer Graphics Graphics Hardware
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
Future role of DMR in Cyber Infrastructure D. Ceperley NCSA, University of Illinois Urbana-Champaign N.B. All views expressed are my own.
Results Matter. Trust NAG. Numerical Algorithms Group Mathematics and technology for optimized performance Alternative Processors Panel IDC, Tucson, Sept.
GPU in HPC Scott A. Friedman ATS Research Computing Technologies.
1 © 2012 The MathWorks, Inc. Parallel computing with MATLAB.
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Recognizing Potential Parallelism Introduction to Parallel Programming Part 1.
HPEC 2007 Norm Rubin Fellow AMD Graphics Products Group norman.rubin at amd.com.
1 Latest Generations of Multi Core Processors
Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015.
Neural Networks in Computer Science n CS/PY 231 Lab Presentation # 1 n January 14, 2005 n Mount Union College.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
ATI Stream Computing ATI Radeon™ HD 2900 Series GPU Hardware Overview Micah Villmow May 30, 2008.
Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
C O N F I D E N T I A LC O N F I D E N T I A L ATI FireGL ™ Workstation Graphics from AMD April 2008 AMD Graphics Product Group.
Big data Usman Roshan CS 675. Big data Typically refers to datasets with very large number of instances (rows) as opposed to attributes (columns). Data.
Vector and symbolic processors
STRUCTURAL AGNOSTIC SPMV: ADAPTING CSR-ADAPTIVE FOR IRREGULAR MATRICES MAYANK DAGA AND JOSEPH L. GREATHOUSE AMD RESEARCH ADVANCED MICRO DEVICES, INC.
FAULTSIM: A FAST, CONFIGURABLE MEMORY-RESILIENCE SIMULATOR DAVID A. ROBERTS, AMD RESEARCH PRASHANT J. NAIR, GEORGIA INSTITUTE OF TECHNOLOGY
SIMULATION OF EXASCALE NODES THROUGH RUNTIME HARDWARE MONITORING JOSEPH L. GREATHOUSE, ALEXANDER LYASHEVSKY, MITESH MESWANI, NUWAN JAYASENA, MICHAEL IGNATOWSKI.
SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION MARC S. ORR †§, SHUAI CHE §, AYSE YILMAZER §, BRADFORD M. BECKMANN §, MARK D. HILL †§, DAVID A. WOOD †§ †
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.
IMPLEMENTING A LEADING LOADS PERFORMANCE PREDICTOR ON COMMODITY PROCESSORS BO SU † JOSEPH L. GREATHOUSE ‡ JUNLI GU ‡ MICHAEL BOYER ‡ LI SHEN † ZHIYING.
GPGPU Performance and Power Estimation Using Machine Learning Gene Wu – UT Austin Joseph Greathouse – AMD Research Alexander Lyashevsky – AMD Research.
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Computer Graphics Graphics Hardware
Unit 2 Technology Systems
Big data classification using neural network
GPU Architecture and Its Application
BLIS optimized for EPYCTM Processors
The Small batch (and Other) solutions in Mantle API
Many-core Software Development Platforms
SOC Runtime Gregory Stoner.
Interference from GPU System Service Requests
Interference from GPU System Service Requests
Machine Learning for Performance and Power Modeling of Heterogeneous Systems Joseph L. Greathouse, Gabriel H. Loh Advanced Micro Devices, Inc.
Machine Learning for Performance and Power Modeling of Heterogeneous Systems Joseph L. Greathouse, Gabriel H. Loh Advanced Micro Devices, Inc.
Computer Graphics Graphics Hardware
Advanced Micro Devices, Inc.
Jason Stewart (AMD) | Rolando Caloca O. (Epic Games) | 21 March 2018
Presentation transcript:

Data Intensive Computing on Heterogeneous Platforms Norm Rubin Fellow GPG graphics products group AMD HPEC 2009

What might we see in future platforms? Multi-core implies reprogramming so all kinds of new architectures are possible The traditional view of more performance per year, based on clock changes is over New approaches will change our view of computing platforms Compute is about to change radically

HPEC Sept The drivers for compute 1.New more natural user interfaces 2.Search (e.g., finding things in images) Massive data 3.Search over time changing data (e.g., live video) Massive data 4.Machine learning (starting with simple models) Massive data 5.Standard compute tasks Data intensive covers 2,3, and 4

HPEC Sept Massive Data and machine learning Data-driven models become tractable and usable Less need for analytical models Less need for heuristics Real-time connectivity enables continuous model refinement Poor model is an acceptable starting point Classification accuracy improves over time

HPEC Sept Simple Models Google demonstrated the value of applying massive amounts of computation to language translation in the 2005 NIST machine translation competition. They won all four categories of the competition translating Arabic to English and Chinese to English. Purely statistical approach multilingual United Nations documents comprising over 200 billion words, as well as English-language documents comprising over one trillion words. No one in their machine translation group knew either Chinese or Arabic. Google tops translation ranking. Nov. 6, 2006.

HPEC Sept How Technology May Soon "Read" Your Mind Establish the correspondence between a simple cognitive state (such as the thought of a hammer) and the underlying brain activity. Use machine learning techniques to identify the neural pattern of brain activity underlying various thought processes. MRI scanner generates data Data base of responses of individuals Using fMRI Brain Activation to Identify Cognitive States Associated with Perception of Tools and Dwellings S.V. Shinkareva, R.A. Mason, V.L. Malave, W. Wang, T. M. Mitchel, and M. A. Just, PLoS ONE 3(1): e1394. doi: /journal.pone , January 2, 2008.

HPEC Sept New ways to program graphics A domain expert (the programmer), 3 cameras, silhouette extraction and a large data base of human motions Use 3D movement as a programming language L. Ren, G. Shakhnarovich, J. K. Hodgins, H. Pfister, and P. Viola. Learning silhouette features for control of human motion. ACM Transactions on Graphics, 24(4), October 2005.

HPEC Sept

9 How will GPU/CPU change to meet big data? Current machine models are two separate devices GPU – a graphics thing, maybe also a data parallel accelerator, good for lots of data, small programs CPU – a serial processor, maybe capable of task parallelism, good for limited data, big programs CPU strength is single thread performance (latency machine) GPU strength is massive thread performance (throughput machine)

HPEC Sept GPU compared with CPU GPU Great float Great bandwidth High performance without fixed isa, lots of room for innovation, without breaking code different to program But Limited scratchpad memory in place of a cache CPU Great control flow Lots of existing code Locked into an legacy isa Very good at reuse, real caches But Limited ability to scale with cores

HPEC Sept Some limitations of today's GPU general cross thread communication task switching, task parallel problems nested parallelism dynamic parallelism very expensive to move data between GPU/CPU programmer controlled small scratch pad memory

HPEC Sept The 3C view of the future, (c-cubed?) We are in a time of architectural change much like the switch from mainframes to microprocessors abundant content, connections, compute Slash dot NASA Probe Blasts 461 Gigabytes of Moon Data Daily On its current space scouting mission, NASA's Lunar Reconnaissance Orbiter (LRO) is using a pumped up communications device to deliver 461 gigabytes of data and images per day, at a rate of up to 100 Mbps. As the first high data rate K-band transmitter to fly on a NASA spacecraft, the 13-inch-long tube, called a Traveling Wave Tube Amplifier, is making it possible for NASA scientists to receive massive amounts of images and data about the moon's surface and environment… It kills me that the moon has better bandwidth than my house.461 gigabytes of data and images per day

HPEC Sept Software Challenge Most of the successful parallel applications seem to have dedicated languages (DirectX ® 11/map-reduce/sawzsall) for limited domains Small programs can build interesting applications, Programmers are not super experts in the hardware Programs survive machine generations Can we replicate this success in other domains?

HPEC Sept Can we get performance with high level languages? CUDA and OpenCL are good languages in the hands of experts but they are still too low level We need high level programming models that express computations over data One possible model is “MapReduce” Map part selects interesting data Reduce part combines the interesting data

HPEC Sept MapReduce Map: for each input in parallel  if the input is interesting – output a key and a value, values might be compound structures Sort based on the keys Reduce: in parallel for each key  for each value with the same key, combine

HPEC Sept Observations on MapReduce Developers only write a small part of the program, rest of the code comes from libraries No race conditions are possible No error reporting (just keep going) Can view the program as serial (per input) No developer knows the number of processors Not like pthreads Hard part is reductions, but it appears that there are only a few that are common, so maybe they can be prebuilt

HPEC Sept Notice nested parallelism in MapReduce For each key (in parallel) do a parallel reduction (also parallel) One key may have lots of values, a second key might have a few (work stealing? Dynamic parallelism?)

HPEC Sept K means clustering (representative algorithm) Given N objects each with A attributes and K possible cluster centers Map: for each object find the distance to each center Classify the object into a cluster, based on min distance Key is the cluster, data is the object Reduce: for all objects in the same cluster, find the new cluster centers Repeat till no object changes clusters

HPEC Sept K means demo

HPEC Sept K means performance experimental data CPU 4 - core i7 processor GPU (2 NV 260 mid range GPU’s) GPU programmed in CUDA CPU programming in TBBU (joint work with Balaji Dhanasekaran, U of Virginia) AMD has announced a new higher performance GPU card (58xx series), which supports the industry standard language OpenCL. Performance numbers using OpenCL on AMD hardware Will be posted at

HPEC Sept Performance 4 million objects, 2 attributes, 100 clusters 50 iterations MapReduceTime (sec)4 CPU1CPU 2GPU GPU4CPU1.29 1GPU GPU4CPU1.71 4CPU CPU Measured performance for other sizes shows similar results, better for larger attributes, objects

HPEC Sept K-means Do most of the work on the GPU, but do some of the reductions on the CPU if you have enough idle cores. Preliminary numbers suggest GPU can solve this very well. Can we develop software that automatically adjusts for this, or do programmers have to write variant algorithms?

HPEC Sept Disclaimer and Attribution DISCLAIMER The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2009 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, ATI, the ATI logo, CrossFireX, PowerPlay and Radeon and combinations thereof are trademarks of Advanced Micro Devices, Inc. Other names are for informational purposes only and may be trademarks of their respective owners.

HPEC Sept Questions?