Seven Minute Madness: Heterogeneous Computing Dr. Jason D. Bakos.

Slides:



Advertisements
Similar presentations
Heterogeneous Computing at USC Dept. of Computer Science and Engineering University of South Carolina Dr. Jason D. Bakos Assistant Professor Heterogeneous.
Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Lecture 6: Multicore Systems
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
SE263 Video Analytics Course Project Initial Report Presented by M. Aravind Krishnan, SERC, IISc X. Mei and H. Ling, ICCV’09.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.
Heterogeneous Computing at USC Dept. of Computer Science and Engineering University of South Carolina Dr. Jason D. Bakos Assistant Professor Heterogeneous.
A Sparse Matrix Personality for the Convey HC-1 Dept. of Computer Science and Engineering University of South Carolina Krishna K Nagar, Jason D. Bakos.
Heterogeneous Computing: New Directions for Efficient and Scalable High-Performance Computing Dr. Jason D. Bakos.
Computes the partial dot products for only the diagonal and upper triangle of the input matrix. The vector computed by this architecture is added to the.
Heterogeneous Computing: New Directions for Efficient and Scalable High-Performance Computing CSCE 791 Dr. Jason D. Bakos.
Heterogeneous Computing at USC Dept. of Computer Science and Engineering University of South Carolina Dr. Jason D. Bakos Assistant Professor Heterogeneous.
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
Seven Minute Madness: Special-Purpose Parallel Architectures Dr. Jason D. Bakos.
FPGA vs. GPU for Sparse Matrix Vector Multiply Yan Zhang, Yasser H. Shalabi, Rishabh Jain, Krishna K. Nagar, Jason D. Bakos Dept. of Computer Science and.
Weekly Report Start learning GPU Ph.D. Student: Leo Lee date: Sep. 18, 2009.
FPGA Acceleration of Phylogeny Reconstruction for Whole Genome Data Jason D. Bakos Panormitis E. Elenis Jijun Tang Dept. of Computer Science and Engineering.
The PTX GPU Assembly Simulator and Interpreter N.M. Stiffler Zheming Jin Ibrahim Savran.
Seven Minute Madness: Reconfigurable Computing Dr. Jason D. Bakos.
Seven Minute Madness: Reconfigurable Computing Dr. Jason D. Bakos.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
Heterogeneous Computing Dr. Jason D. Bakos. Heterogeneous Computing 2 “Traditional” Parallel/Multi-Processing Large-scale parallel platforms: –Individual.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Slide 1/8 Performance Debugging for Highly Parallel Accelerator Architectures Saurabh Bagchi ECE & CS, Purdue University Joint work with: Tsungtai Yeh,
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Dec 31, 2012 Emergence of GPU systems and clusters for general purpose High Performance Computing.
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
Heterogeneous Computing: New Directions for Efficient and Scalable High-Performance Computing CSCE 791 Dr. Jason D. Bakos.
Trends in the Infrastructure of Computing: Processing, Storage, Bandwidth CSCE 190: Computing in the Modern World Dr. Jason D. Bakos.
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
PERFORMANCE ANALYSIS cont. End-to-End Speedup  Execution time includes communication costs between FPGA and host machine  FPGA consistently outperforms.
Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
VTU – IISc Workshop Compiler, Architecture and HPC Research in Heterogeneous Multi-Core Era R. Govindarajan CSA & SERC, IISc
GPU Architecture and Programming
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
By Dirk Hekhuis Advisors Dr. Greg Wolffe Dr. Christian Trefftz.
Trends in the Infrastructure of Computing
Program Optimizations and Recent Trends in Heterogeneous Parallel Computing Dušan Gajić, University of Niš Program Optimizations and Recent Trends in Heterogeneous.
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
Understanding Parallel Computers Parallel Processing EE 613.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
Trends in the Infrastructure of Computing: Processing, Storage, Bandwidth CSCE 190: Computing in the Modern World Dr. Jason D. Bakos.
My Coordinates Office EM G.27 contact time:
Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
“Processors” issues for LQCD January 2009 André Seznec IRISA/INRIA.
Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 © Barry Wilkinson GPUIntro.ppt Oct 30, 2014.
Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.
Seven Minute Madness: Heterogeneous Computing Dr. Jason D. Bakos.
Multi-Core CPUs Matt Kuehn. Roadmap ► Intel vs AMD ► Early multi-core processors ► Threads vs Physical Cores ► Multithreading and Multi-core processing.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Parallel Computing Lecture
CSCE 190: Computing in the Modern World Dr. Jason D. Bakos
Embedded Computer Architecture 5SIA0 Overview
Graphics Processing Unit
Multicore and GPU Programming
Multicore and GPU Programming
Presentation transcript:

Seven Minute Madness: Heterogeneous Computing Dr. Jason D. Bakos

Computer Architecture Trends Goal: extract maximum performance from serial code –Instruction re-ordering –Speculative execution –Multiple issue –Branch prediction –Large, multi-level caches Onus is on processor and compiler to extract performance from code Results in large control unit and cache Heterogeneous Computing 2

Computer Architecture Trends Multicore architecture: –Allows programmer to extract performance Heterogeneous Computing 3 Memory CPU

Computer Architecture Trends Special purpose processor –Small control and cache –Devote maximum area to ALUs Shifts onus to programmer to get performance Requires careful programming—only practical for expensive computations Examples: GPU, Cell, FPGA Heterogeneous Computing 4

Heterogeneous Computing 5 Heterogeneous Computing CPU X58 Host Memory Co- processor QPIPCIe On board Memory add-in cardhost General purpose CPU + special-purpose processors in one system ~25 GB/s ~8 GB/s (x16) ????? ~150 GB/s for GeForce 260

Heterogeneous Computing Heterogeneous Computing 6 initialization 0.5% of run time “hot” loop 99% of run time clean up 0.5% of run time 49% of code 1% of code co-processor Kernel speedup Application speedup Execution time hours hours hours hours hours Example: –Application requires a week of CPU time –Offload computation consumes 99% of execution time

NVIDIA GT200 GPU Architecture Heterogeneous Computing 7 30 “streaming multiprocesors” Simple cores: –In-order execution, no: branch prediction, spec. execution, multiple issue, context switches –No cache (just 16K programmer- managed on-chip memory) –One active instruction at a time Can execute 32 threads in parallel per SM if no branch divergence

IBM Cell/B.E. Architecture Heterogeneous Computing 8 1 PowerPC, 8 small processors Programmer must manually manage 256K memory and threads invocation on each SPE Each SPE includes a vector unit like the one on current Intel processors –128 bits wide

Heterogeneous Computing 9 Heterogeneous Computing now Mainstream: IBM Roadrunner Los Alamos, fastest computer in the world 6,480 AMD Opteron (dual core) CPUs 12,960 PowerXCell 8i GPUs Each blade contains 2 Operons and 4 Cells 296 racks

Heterogeneous Computing 10 Heterogeneous Computing with FPGAs

Heterogeneous Computing 11 Programming FPGAs

Our Projects FPGA-based co-processors for computational biology Heterogeneous Computing 12 1.Tiffany M. Mintz, Jason D. Bakos, "A Cluster-on-a-Chip Architecture for High-Throughput Phylogeny Search," IEEE Trans. on Parallel and Distributed Systems, submitted. 2.Stephanie Zierke, Jason D. Bakos, "FPGA Acceleration of Bayesian Phylogenetic Inference," BMC Bioinformatics, submitted. 3.Jason D. Bakos, Panormitis E. Elenis, "A Special-Purpose Architecture for Solving the Breakpoint Median Problem," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 16, No. 12, Dec Jason D. Bakos, Panormitis E. Elenis, Jijun Tang, "FPGA Acceleration of Phylogeny Reconstruction for Whole Genome Data," 7th IEEE International Symposium on Bioinformatics & Bioengineering (BIBE'07), Boston, MA, Oct , Jason D. Bakos, “FPGA Acceleration of Gene Rearrangement Analysis,” 15th Annual IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM'07), April 23-25, 2007.

Our Projects FPGA-based co-processors for linear algebra Heterogeneous Computing 13 1.Krishna.K. Nagar, Jason D. Bakos, "A High-Performance Double Precision Accumulator," IEEE International Conference on Field-Programmable Technology (IC-FPT'09), Dec. 9-11, Yan Zhang, Yasser Shalabi, Rishabh Jain, Krishna K. Nagar, Jason D. Bakos, "FPGA vs. GPU for Sparse Matrix Vector Multiply," IEEE International Conference on Field-Programmable Technology (IC-FPT'09), Dec. 9-11, Krishna K. Nagar, Yan Zhang, Jason D. Bakos, "An Integrated Reduction Technique for a Double Precision Accumulator," Proc. Third International Workshop on High-Performance Reconfigurable Computing Technology and Applications (HPRCTA'09), held in conjunction with Supercomputing 2009 (SC'09), Nov. 15, Jason D. Bakos, Krishna K. Nagar, "Exploiting Matrix Symmetry to Improve FPGA-Accelerated Conjugate Gradient," 17th Annual IEEE International Symposium on Field Programmable Custom Computing Machines (FCCM'09), April 5-8, 2009.

Our Projects Heterogeneous Computing 14 Multi-FPGA System Architectures 1.Jason D. Bakos, Charles L. Cathey, E. Allen Michalski, "Predictive Load Balancing for Interconnected FPGAs," 16th International Conference on Field Programmable Logic and Applications (FPL'06), Madrid, Spain, August 28-30, Charles L. Cathey, Jason D. Bakos, Duncan A. Buell, "A Reconfigurable Distributed Computing Fabric Exploiting Multilevel Parallelism," 14th Annual IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM'06), April 24-26, GPU Simulation 1.Patrick A. Moran, Jason D. Bakos, "A PTX Simulator for Performance Tuning CUDA Code," IEEE Trans. on Parallel and Distributed Systems, submitted.

Most Recent Projects Task Partitioning for Heterogeneous Computing Heterogeneous Computing 15 GPU and FPGA Acceleration of Data Mining

Contact Information Jason D. Bakos –Office: 3A52 – – Heterogeneous and Reconfigurable Computing (HeRC) Lab: –Lab: 3D15 – Course –Spring 2010: CSCE 212, section 001 (TTH 11:00) Heterogeneous Computing 16