High Performance Computing On Laptops With Multicores & GPUs

Slides:

Advertisements

Similar presentations

Multi-core Computing Lecture 3 MADALGO Summer School 2012 Algorithms for Modern Parallel and Distributed Models Phillip B. Gibbons Intel Labs Pittsburgh.

Advertisements

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.

SALSA HPC Group School of Informatics and Computing Indiana University.

GPU Virtualization Support in Cloud System Ching-Chi Lin Institute of Information Science, Academia Sinica Department of Computer Science and Information.

Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.

HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.

PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.

Weekly Report Start learning GPU Ph.D. Student: Leo Lee date: Sep. 18, 2009.

SensIT PI Meeting, April 17-20, Distributed Services for Self-Organizing Sensor Networks Alvin S. Lim Computer Science and Software Engineering.

Graphics Processing Units

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

07/14/08. 2 Points Introduction. Cluster and Supercomputers. Cluster Types and Advantages. Our Cluster. Cluster Performance. Cluster Computer for Basic.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

Chapter 18 Multicore Computers

Apache Airavata GSOC Knowledge and Expertise Computational Resources Scientific Instruments Algorithms and Models Archived Data and Metadata Advanced.

2012/06/22 Contents  GPU (Graphic Processing Unit)  CUDA Programming  Target: Clustering with Kmeans  How to use.

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.

UIUC CSL Global Technology Forum © NVIDIA Corporation 2007 Computing in Crisis: Challenges and Opportunities David B. Kirk.

GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.

Exploiting Data Parallelism in SELinux Using a Multicore Processor Bodhisatta Barman Roy National University of Singapore, Singapore Arun Kalyanasundaram,

The Central Processing Unit

SALSA HPC Group School of Informatics and Computing Indiana University.

Tracking with CACTuS on Jetson Running a Bayesian multi object tracker on an embedded system School of Information Technology & Mathematical Sciences September.

Experiences Accelerating MATLAB Systems Biology Applications Heart Wall Tracking Lukasz Szafaryn, Kevin Skadron University of Virginia.

Heterogeneous CPU/GPU co- processor clusters Michael Fruchtman.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

ICAL GPU 架構中所提供分散式運算之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.

GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.

GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.

Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC.

Martin Kruliš by Martin Kruliš (v1.1)1.

DSN & SensorWare Projects Rockwell Science Center –Charles Chien UCLA –Mani Srivastava, Miodrag Potkonjak USC/ISI –Brian Schott, Bob Parker Virginia Tech.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 1.

Understanding Parallel Computers Parallel Processing EE 613.

3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.

SSU 1 Dr.A.Srinivas PES Institute of Technology Bangalore, India 9 – 20 July 2012.

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

Computer Engg, IIT(BHU)

NFV Compute Acceleration APIs and Evaluation

Chapter 10: Computer systems (1)

Volunteer Computing for Science Gateways

CMSC 611: Advanced Computer Architecture

Multi-core processors

ALICE HLT tracking running on GPU

Parallel Algorithm Design

Map-Scan Node Accelerator for Big-Data

Real-Time Ray Tracing Stefan Popov.

Lecture 2: Intro to the simd lifestyle and GPU internals

Clusters of Computational Accelerators

Collaborative Offloading for Distributed Mobile-Cloud Apps

“The Brain”… I will rule the world!

Embedded Computer Architecture 5SIA0 Overview

GPU Introduction: Uses, Architecture, and Programming Model

CS 584 Lecture7 Assignment -- Due Now! Paper Review is due next week.

Graphics Processing Unit

Jianting Zhang1,2 Simin You2, Le Gruenwald3

6- General Purpose GPU Programming

Multicore and GPU Programming

Sequence alignment, E-value & Extreme value distribution

Reconfigurable Computing (EN2911X, Fall07)

Presentation transcript:

High Performance Computing On Laptops With Multicores & GPUs Sushil K. Prasad Computer Science sprasad@gsu.edu

About me Research Area: Parallel and Distributed Algorithms and Systems - over multicores, GPUs, clusters, sensors, handhelds, web services, … Lab: Distributed and Mobile Systems (DiMoS) at Ga. Tech campus, 5 PhD students, 2 M.S. students IEEE TCPP Chair (elected) 2 NSF grants – currently looking for PhD/MS/undergraduate students Distributed Algorithms High Performance Cloud Computing

Multicore & GPU Chips Inside a Laptop - 100s of processors - Big machines and clusters are not the only platforms for high end computing For the first time in history, almost anyone can own a parallel computer: your laptop has dual core CPU + many-core GPU (240 cores in a nVIDIA 280 GTX) Cost is $500. In 2000, we spent $300K to purchase a 24 CPU SGI computer. For $40K, we just bought a cluster with 10 compute nodes with 88 cores + four GPUs with 240 cores each – total > 1000 cores!

GPUs Vs Multicores Combined power exceeds 180 GFLOPs

Intel Core-2 Duo Multicore Difficult to parallelize Memory hierarchy is a barrier: 1 cycle core 3 cycles L1 cache 14 cycles L2 250 cycles RAM - Difficult to parallelize - Memory hierarchy is a barrier: 1 cycle core (1/3 ns), 3 cycles L1, 14 cycles L2, and 250 cycles RAM

GPU: Graphics Processing Unit Nvidia 280 GTX 240 cores Extreme memory hierarchy Registers Local memory Shared memory/8 cores Off chip Global Memory bottleneck bus to CPU Good research needed – hot area

Smith Waterman Seq Alignment, Fasta, and Blast Nvidia 8800 GTX Smith Waterman Seq Alignment, Fasta, and Blast Database: SwissProt Manavski and Valle 2008 Smith-Waterman in CUDA running on single and double GPU vs. BLAST and SSEARCH. Substitution matrix used: BLOSUM50. Gap-open penalty: 10. Gap-extension penalty: 2. Database used: SwissProt (Dec. 2006 – 250,296 proteins and 91,694,534 amino acids). * Smith-Waterman in CUDA running on an NVidia GeForce 8800 GTX ** Smith-Waterman in CUDA running on two NVidia GeForce 8800 GTX

Parallel Data Structures -Priority Queues Large Scale Event Simulation Immune System Simulation VLSI Logic simulation Branch and Bound Task Scheduling Challenge: Fine Grained Systems Students: Dinesh Agarwal, Nick Mancuso 5 3 1 2 6 8 7 9 19 21 12 14 23 34 25 38 16 13 65 10 15

Parallel Priority Queues on Multicore

Legacy-Code to GPUs (Student: Chad Christopher)

Distributed Algorithms for Lifetime of Wireless Sensor Networks (Student: Akshaye Dhawan)

NP-Hard Distributed Problems in Networks NSF Grant Minimum Vertex/Target Cover Minimum Triangle Packing Optimum mobile sensor network target tracking Minimum channel assignment in mobile ad-hoc networks Students: John Daigle, Thamer Sulaiman

Middleware for Mobile Ad–hoc Applications Mobile Support Station Applications Deviceware Process Requests 3. p2p communication Applications Listener Applications Deviceware Process Requests 2. Lookup Bottom-up Listener 1. Register Deviceware Groupware Process Requests Listener Listener Process Requests 18 February 2019 UM-Morris Directory

BondFlow: Distributed Workflow over Web Services (Student: Janaka Balasooriya) Web service interface module Proxy object generator module Workflow configuration module Execution module. Mobile Web Services Web Service Interface Module Lookup for Web services Web Services Registry (UDDI) S O A P WS Locator WSDL WSDL Parser Parsed WSDL Workflow Execution Module Proxy Object Generator Module Web Bond Runtime SOAP/ SyD Workflow Configuration Module JVM

A Posterior Uncertainty P2P Search based on Bayesian Decision and Value of Information (VOI) – (Student: Rasanjalee) The meaning of Uncertainty based Information Peer Selection: Sending/forwarding query at each node along query path = series of decision making steps based on incomplete data A decision step: query the node that will reduce the uncertainty of current belief most. Experimental Results: A Priori Uncertainty : U1 A Posterior Uncertainty : U2 U1 –U2 = Information The reduction in uncertainty at each decision step Current Belief Decision step 1 . . Decision step n

Middleware on Distributed Smart Cameras Middleware on DSC networks provide a high-level programming interface for applications. simplify the development of distributed applications on DSC networks. provide networking functionality as part of the middleware Student: Jayampathi Sampat cmucam3

About me Research Area: Parallel and Distributed Algorithms and Systems - over multicores, GPUs, clusters, sensors, handhelds, web services, … Lab: Distributed and Mobile Systems (DiMoS) at Ga. Tech campus, 5 PhD students, 2 M.S. students IEEE TCPP Chair (elected) 2 NSF grants – currently looking for PhD/MS/undergraduate students Distributed Algorithms High Performance Cloud Computing

High Performance Computing On Laptops With Multicores & GPUs Sushil K. Prasad Computer Science sprasad@gsu.edu