Clusters of Computational Accelerators

Slides:



Advertisements
Similar presentations
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 6: Multicore Systems
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.
Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.
Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
Challenge the future Delft University of Technology Evaluating Multi-Core Processors for Data-Intensive Kernels Alexander van Amesfoort Delft.
L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Weekly Report Start learning GPU Ph.D. Student: Leo Lee date: Sep. 18, 2009.
Parallelization and CUDA libraries Lei Zhou, Yafeng Yin, Hong Man.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
Adnan Ozsoy & Martin Swany DAMSL - Distributed and MetaSystems Lab Department of Computer Information and Science University of Delaware September 2011.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
GPGPU platforms GP - General Purpose computation using GPU
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
2012/06/22 Contents  GPU (Graphic Processing Unit)  CUDA Programming  Target: Clustering with Kmeans  How to use.
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
1 Chapter 04 Authors: John Hennessy & David Patterson.
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
GPU in HPC Scott A. Friedman ATS Research Computing Technologies.
YOU LI SUPERVISOR: DR. CHU XIAOWEN CO-SUPERVISOR: PROF. LIU JIMING THURSDAY, MARCH 11, 2010 Speeding up k-Means by GPUs 1.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
GPU Architecture and Programming
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
QCAdesigner – CUDA HPPS project
By Dirk Hekhuis Advisors Dr. Greg Wolffe Dr. Christian Trefftz.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
Rigel: An Architecture and Scalable Programming Interface for a 1000-core Accelerator Paper Presentation Yifeng (Felix) Zeng University of Missouri.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
My Coordinates Office EM G.27 contact time:
Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Multi-Core CPUs Matt Kuehn. Roadmap ► Intel vs AMD ► Early multi-core processors ► Threads vs Physical Cores ► Multithreading and Multi-core processing.
GPU Acceleration of Particle-In-Cell Methods B. M. Cowan, J. R. Cary, S. W. Sides Tech-X Corporation.
Computer Graphics Graphics Hardware
Prof. Zhang Gang School of Computer Sci. & Tech.
Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 July 12, 2012 © Barry Wilkinson CUDAIntro.ppt.
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
GPU Architecture and Its Application
CS427 Multicore Architecture and Parallel Computing
Enabling Effective Utilization of GPUs for Data Management Systems
Parallel Processing - introduction
Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 14 The Roofline Visual Performance Model Prof. Zhang Gang
Morgan Kaufmann Publishers
Computer-Generated Force Acceleration using GPUs: Next Steps
Faster File matching using GPGPU’s Deephan Mohan Professor: Dr
NVIDIA Fermi Architecture
All-Pairs Shortest Paths
Computer Graphics Graphics Hardware
Chapter 4 Multiprocessors
Multicore and GPU Programming
6- General Purpose GPU Programming
Multicore and GPU Programming
Presentation transcript:

Clusters of Computational Accelerators Jan Prins UNC-Chapel Hill

Topics Similarity of accelerator architectures proof-of-concept kernels for high performance applications New application areas for accelerator architectures

Accelerator architectures Existing commodity accelerators Sony/Toshiba/IBM Cell BE Nvidia G80 GPU compute-unified device architecture (CUDA) ATI R600 (almost) Related developments Intel demonstrates 80-core TFlop chip multicore projects March of progress: next generation GPUs Roadrunner to be based on 2nd gen Cell

Cell BE and Nvidia G80 Cell BE GeForce 8800 GTX similarities: 8 cores, local store, vectors/Simd (4 vs 16), high speed device memory differences: Cell integrated PPC, EIB. G80 local caching, extensive multithreading Cell BE GeForce 8800 GTX

Programming for the memory hierarchy Local Memory Global address space Cache ALU Regs length – latency, thickness – bandwidth, aspect ratio – size of transfers simple parallel memory hierarchy (PMH) simple uniprocessor memory hierarchy (UMH)

Accelerator memory hierarchy Device Memory Parallelism Vector / SIMD multithreading multiprocessing Local Store Local Store Vector elts Vector elts Vector elts Vector elts

Programming accelerators Package inherent parallelism available in problem to provide the concurrency and parallel slack needed at every level of PMH serialize where needed to reach appropriate level of reuse Programming models explicit notion of locality CUDA UPC

Clusters of Accelerators Scale PMH Peak Perf Cost Rack Global Address Space 20TF $250K Node Local 400GF $4K CPU L2/L3 core L1 Accelerator Device 200GF $1K SIMD Vector

Proof of concept kernels Demonstrating performance of accelerator clusters challenge is towards the bottom of the parallel memory hierarchy proof-of-concept kernels can establish viability and scaling Example n-body kernels demonstrated to achieve strong performance on Cell and G80 Consequence Folding at home clients developed for Playstation and PCs with high-end ATI GPU. Full GROMACS acceleration on Cell, NAMD acceleration on G80 underway

New application domains Database and datamining operations Stream mining

Stream mining applications Sampling Aggregation Summarization Clustering dimensionality reduction PCA, SVD subspace clustering Classification Anomaly Detection

Challenges Continuous data flow Limited storage space Limited communication bandwidth through hierarchy Detecting and modeling changes Visualization

Conclusions Techniques to effectively exploit accelerator clusters are relatively independent of particular choice of accelerator Application demonstrations can follow spiral development model focusing on implementation of key kernels Data mining and stream mining are important application areas that may be well served by accelerator architectures