Heterogeneous Computing: New Directions for Efficient and Scalable High-Performance Computing CSCE 791 Dr. Jason D. Bakos.

Slides:



Advertisements
Similar presentations
Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.
Advertisements

Heterogeneous Computing at USC Dept. of Computer Science and Engineering University of South Carolina Dr. Jason D. Bakos Assistant Professor Heterogeneous.
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.
An Integrated Reduction Technique for a Double Precision Accumulator Krishna Nagar, Yan Zhang, Jason Bakos Dept. of Computer Science and Engineering University.
Lecture 6: Multicore Systems
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
ACCELERATING MATRIX LANGUAGES WITH THE CELL BROADBAND ENGINE Raymes Khoury The University of Sydney.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
GRAPHICS AND COMPUTING GPUS Jehan-François Pâris
Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.
Heterogeneous Computing at USC Dept. of Computer Science and Engineering University of South Carolina Dr. Jason D. Bakos Assistant Professor Heterogeneous.
A Sparse Matrix Personality for the Convey HC-1 Dept. of Computer Science and Engineering University of South Carolina Krishna K Nagar, Jason D. Bakos.
Heterogeneous Computing: New Directions for Efficient and Scalable High-Performance Computing Dr. Jason D. Bakos.
Computes the partial dot products for only the diagonal and upper triangle of the input matrix. The vector computed by this architecture is added to the.
Heterogeneous Computing at USC Dept. of Computer Science and Engineering University of South Carolina Dr. Jason D. Bakos Assistant Professor Heterogeneous.
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
Seven Minute Madness: Special-Purpose Parallel Architectures Dr. Jason D. Bakos.
FPGA vs. GPU for Sparse Matrix Vector Multiply Yan Zhang, Yasser H. Shalabi, Rishabh Jain, Krishna K. Nagar, Jason D. Bakos Dept. of Computer Science and.
FPGA Acceleration of Phylogeny Reconstruction for Whole Genome Data Jason D. Bakos Panormitis E. Elenis Jijun Tang Dept. of Computer Science and Engineering.
The PTX GPU Assembly Simulator and Interpreter N.M. Stiffler Zheming Jin Ibrahim Savran.
Seven Minute Madness: Reconfigurable Computing Dr. Jason D. Bakos.
Trends in the Infrastructure of Computing: Processing, Storage, Bandwidth CSCE 190: Computing in the Modern World Dr. Jason D. Bakos.
FPGA Acceleration of Gene Rearrangement Analysis Jason D. Bakos Dept. of Computer Science and Engineering University of South Carolina Columbia, SC USA.
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.
Seven Minute Madness: Reconfigurable Computing Dr. Jason D. Bakos.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
Heterogeneous Computing Dr. Jason D. Bakos. Heterogeneous Computing 2 “Traditional” Parallel/Multi-Processing Large-scale parallel platforms: –Individual.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
GPGPU platforms GP - General Purpose computation using GPU
Slide 1/8 Performance Debugging for Highly Parallel Accelerator Architectures Saurabh Bagchi ECE & CS, Purdue University Joint work with: Tsungtai Yeh,
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright All Rights Reserved 1 FPGA based Acceleration of Linear Algebra Computations. B.Y. Vinay.
February 12, 1998 Aman Sareen DPGA-Coupled Microprocessors Commodity IC’s for the Early 21st Century by Aman Sareen School of Electrical Engineering and.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Floating Point vs. Fixed Point for FPGA 1. Applications Digital Signal Processing -Encoders/Decoders -Compression -Encryption Control -Automotive/Aerospace.
1 Chapter 04 Authors: John Hennessy & David Patterson.
Heterogeneous Computing: New Directions for Efficient and Scalable High-Performance Computing CSCE 791 Dr. Jason D. Bakos.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Trends in the Infrastructure of Computing: Processing, Storage, Bandwidth CSCE 190: Computing in the Modern World Dr. Jason D. Bakos.
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors
PERFORMANCE ANALYSIS cont. End-to-End Speedup  Execution time includes communication costs between FPGA and host machine  FPGA consistently outperforms.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.
Seven Minute Madness: Heterogeneous Computing Dr. Jason D. Bakos.
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
Introduction to MMX, XMM, SSE and SSE2 Technology
Trends in the Infrastructure of Computing
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Program Optimizations and Recent Trends in Heterogeneous Parallel Computing Dušan Gajić, University of Niš Program Optimizations and Recent Trends in Heterogeneous.
FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR Moore’s Law n Gordon Moore: co-founder of Intel. n Predicted that number of transistors.
Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,
Sparse Matrix-Vector Multiply on the Keystone II Digital Signal Processor Yang Gao, Fan Zhang and Dr. Jason D. Bakos 2014 IEEE High Performance Extreme.
A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.
Trends in the Infrastructure of Computing: Processing, Storage, Bandwidth CSCE 190: Computing in the Modern World Dr. Jason D. Bakos.
My Coordinates Office EM G.27 contact time:
Introduction to Intrusion Detection Systems. All incoming packets are filtered for specific characteristics or content Databases have thousands of patterns.
Philipp Gysel ECE Department University of California, Davis
Seven Minute Madness: Heterogeneous Computing Dr. Jason D. Bakos.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
HPEC 2003 Linear Algebra Processor using FPGA Jeremy Johnson, Prawat Nagvajara, Chika Nwankpa Drexel University.
Yang Gao and Dr. Jason D. Bakos
CSCE 190: Computing in the Modern World Dr. Jason D. Bakos
Centar ( Global Signal Processing Expo
Multicore and GPU Programming
Multicore and GPU Programming
Presentation transcript:

Heterogeneous Computing: New Directions for Efficient and Scalable High-Performance Computing CSCE 791 Dr. Jason D. Bakos

CSCE 791April 2, Minimum Feature Size YearProcessorSpeedTransistorsProcess 1982i MHz~134, m 1986i38616 – 40 MHz~270,000 1 m 1989i MHz~1 million.8 m 1993Pentium MHz~3 million.6 m 1995Pentium Pro MHz~4 million.5 m 1997Pentium II MHz~5 million.35 m 1999Pentium III450 – 1400 MHz~10 million.25 m 2000Pentium 41.3 – 3.8 GHz~50 million.18 m 2005Pentium D2 cores/package~200 million.09 m 2006Core 22 cores/die~300 million.065 m 2008Core i74 cores/die 8 threads/die ~800 million.045 m 2010“Sandy Bridge” 8 cores/die 16 threads/die?? ??.032 m

Computer Architecture Trends Multi-core architecture: –Individual cores are large and heavyweight, designed to force performance out of generalized code –Programmer utilizes multi-core using OpenMP CSCE 791April 2, L2 Cache (~50% chip) CPU Memory

CSCE 791April 2, “Traditional” Parallel/Multi-Processing Large-scale parallel platforms: –Individual computers connected with a high-speed interconnect Upper bound for speedup is n, where n = # processors –How much parallelism in program? –System, network overheads?

Co-Processors CSCE 791April 2, Special-purpose (not general) processor Accelerates CPU

NVIDIA GT200 GPU Architecture CSCE 791April 2, on-chip processor cores Simple cores: –In-order execution, no branch prediction, spec. execution, multiple issue –No support for context switches, OS, activation stack, dynamic memory –No r/w cache (just 16K programmer- managed on-chip memory) –Threads must be comprised on identical code, must all behave the same w.r.t. if-statements and loops

IBM Cell/B.E. Architecture CSCE 791April 2, PPE, 8 SPEs Programmer must manually manage 256K memory and threads invocation on each SPE Each SPE includes a vector unit like the one on current Intel processors –128 bits wide

CSCE 791April 2, High-Performance Reconfigurable Computing Heterogeneous computing with reconfigurable logic, i.e. FPGAs

Field-Programmable Gate Array CSCE 791April 2,

CSCE 791April 2, Programming FPGAs

CSCE 791April 2, HC Execution Model CPU X58 Host Memory Co- processor QPIPCIe On board Memory add-in cardhost ~25 GB/s ~8 GB/s (x16) ????? ~100 GB/s for GeForce 260

Heterogeneous Computing CSCE 791April 2, initialization 0.5% of run time “hot” loop 99% of run time clean up 0.5% of run time 49% of code 1% of code co-processor Kernel speedup Application speedup Execution time hours hours hours hours hours Example: –Application requires a week of CPU time –Offload computation consumes 99% of execution time

CSCE 791April 2, Heterogeneous Computing with FPGAs Annapolis Micro Systems WILDSTAR 2 PRO GiDEL PROCSTAR III

Heterogeneous Computing with FPGAs CSCE 791April 2, Convey HC-1

Heterogeneous Computing with GPUs CSCE 791April 2, NVIDIA Tesla S1070

CSCE 791April 2, Heterogeneous Computing now Mainstream: IBM Roadrunner Los Alamos, second fastest computer in the world 6,480 AMD Opteron (dual core) CPUs 12,960 PowerXCell 8i GPUs Each blade contains 2 Operons and 4 Cells 296 racks First ever petaflop machine (2008) 1.71 petaflops peak (1.7 billion million fp operations per second) 2.35 MW (not including cooling) –Lake Murray hydroelectric plant produces ~150 MW (peak) –Lake Murray coal plant (McMeekin Station) produces ~300 MW (peak) –Catawba Nuclear Station near Rock Hill produces 2258 MW

Our Group: HeRC Applications work –Computational phylogenetics (FPGA/GPU) GRAPPA and MrBayes –Sparse linear algebra (FPGA/GPU) Matrix-vector multiply, double-precision accumulators –Data mining (FPGA/GPU) –Logic minimization (GPU) System architecture –Multi-FPGA interconnects Tools –Automatic partitioning (PATHS) –Micro-architectural simulation for code tuning CSCE 791April 2,

CSCE 791April 2, Phylogenies genus Drosophila

FCCM 2007 Napa, CAApril 23, 2007 Custom Accelerators for Phylogenetics g1 g2 g5 g4g6 g3 g5 g4 g1g3 g2 g5 g6 g5 g2 g1 g6 g3 g4 Unrooted binary tree n leaf vertices n - 2 internal vertices (degree 3) Tree configurations = (2n - 5) * (2n - 7) * (2n - 9) * … * trillion trees for 16 leaves

Our Projects FPGA-based co-processors for computational biology CSCE 791April 2, Tiffany M. Mintz, Jason D. Bakos, "A Cluster-on-a-Chip Architecture for High-Throughput Phylogeny Search," IEEE Trans. on Parallel and Distributed Systems, in press. 2.Stephanie Zierke, Jason D. Bakos, "FPGA Acceleration of Bayesian Phylogenetic Inference," BMC Bioinformatics, in press. 3.Jason D. Bakos, Panormitis E. Elenis, "A Special-Purpose Architecture for Solving the Breakpoint Median Problem," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 16, No. 12, Dec Jason D. Bakos, Panormitis E. Elenis, Jijun Tang, "FPGA Acceleration of Phylogeny Reconstruction for Whole Genome Data," 7th IEEE International Symposium on Bioinformatics & Bioengineering (BIBE'07), Boston, MA, Oct , Jason D. Bakos, “FPGA Acceleration of Gene Rearrangement Analysis,” 15th Annual IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM'07), April 23-25, X speedup! 10X speedup!

Double Precision Accumulation FPGAs allow data to be “streamed” into a computational pipeline Many kernels targeted for acceleration include –Such as: dot product, used for MVM: kernel for many methods For large datasets, values delivered serially to an accumulator –Reduction operation CSCE 791April 2, A, set 1 Σ B, set 1 C, set 1 D, set 2 E, set 2 F, set 2 G, set 3 A+B +C, set 1 D+E +F, set 2 H, set 3 I, set 3 G+H +I, set 3

The Reduction Problem + + Mem Control Partial sums Basic Accumulator Architecture Adder Pipeline Required Design Reduction Ckt Feedback Loop CSCE 791April 2,

Approach Reduction complexity scales with the latency of the core operation –Reduce latency of double precision add? IEEE 754 adder pipeline (assume 4-bit significand): CSCE 791April 2, Compare exponents Add 53-bit mantissas De- normalize smaller value Round Re- normalize x x x x x x x 2 24 Round x 2 24

Base Conversion Previous work in s.p. MAC designs base conversion –Idea: Shift both inputs to the left by amout specified in low-order bits of exponents Reduces size of exponent, requires wider adder Example: –Base-8 conversion: , exp=10110 ( x 2 22 => ~5.7 million) Shift to the left by 6 bits… , exp=10 (87.25 x 2 8*2 = > ~5.7 million) CSCE 791April 2,

Exponent Compare vs. Adder Width CSCE 791April 2, Base Exponent Width Denormalize speed Adder Width#DSP48s MHz MHz MHz MHz MHz3107 denormDSP48 renorm

Accumulator Design CSCE 791April 2,

Accumulator Design Feedback Loop Preprocess Post-process α = 3 CSCE 791April 2,

Three-Stage Reduction Architecture CSCE 791April 2, “Adder” pipeline Input buffer Output buffer Input

Three-Stage Reduction Architecture CSCE 791April 2, “Adder” pipeline Input buffer Output buffer 33 22 11 B1 Input 0

Three-Stage Reduction Architecture CSCE 791April 2, “Adder” pipeline Input buffer Output buffer 33 22 11 B2 Input B1

Three-Stage Reduction Architecture CSCE 791April 2, “Adder” pipeline Input buffer Output buffer B1 33 B2 Input  2 B3

Three-Stage Reduction Architecture CSCE 791April 2, “Adder” pipeline Input buffer Output buffer B1 33 Input  2 B4 B2+B3

Three-Stage Reduction Architecture CSCE 791April 2, “Adder” pipeline Input buffer Output buffer 33 Input  2 B5 B2+B3B1+B4

Three-Stage Reduction Architecture CSCE 791April 2, “Adder” pipeline Input buffer Output buffer Input  2  3 B6 B2+B3B1+B4 B5

Three-Stage Reduction Architecture CSCE 791April 2, “Adder” pipeline Input buffer Output buffer Input  2  3 B7 B2+B3 +B6 B1+B4 B5

Three-Stage Reduction Architecture CSCE 791April 2, “Adder” pipeline Input buffer Output buffer Input  2  3 B8 B2+B3 +B6 B1+B4 +B7 B5

Three-Stage Reduction Architecture CSCE 791April 2, “Adder” pipeline Input buffer Output buffer Input C1 B2+B3 +B6 B1+B4 +B7 B5+B8 0

Minimum Set Size Four “configurations”: Deterministic control sequence, triggered by set change: –D, A, C, B, A, B, B, C, B/D Minimum set size is 8 CSCE 791April 2,

Use Case: Sparse Matrix-Vector Multiply CSCE 791April 2, A000B0 000C0D E000FG H I0J0 000K00 val col ptr ABCDEFGHIJK (A,0) (B,4) (0,0) (C,3) (D,4) (0,0)… Group vol/col Zero-terminate

New SpMV Architecture CSCE 791April 2, Delete tree, replicate accumulator, schedule matrix data: 400 bits

Performance Figures GPUFPGA Matrix Order/ dimensions nznz Avg. n z /row Mem. BW (GB/s) GFLOPs GFLOPs ( 8.5 GB/s) TSOPF_RS_b162_c E40r Simon/olafu Garon/garon Mallya/lhr11c Hollinger/mark3jac020sc Bai/dw YCheng/psse x GHS_indef/ncvxqp CSCE 791April 2,

Performance Comparison If FPGA Memory bandwidth scaled by adding multipliers/ accumulators to match GPU Memory Bandwidth for each matrix separately GPU Mem. BW (GB/s) FPGA Mem BW (GB/s) GB/s (x6) GB/s (x6) GB/s (x6) GB/s (x5) GB/s (x4) GB/s (x3) GB/s (x3) GB/s (x3) GB/s (x3) CSCE 791April 2,

Our Projects FPGA-based co-processors for linear algebra CSCE 791April 2, Krishna.K. Nagar, Jason D. Bakos, "A High-Performance Double Precision Accumulator," IEEE International Conference on Field-Programmable Technology (IC-FPT'09), Dec. 9-11, Yan Zhang, Yasser Shalabi, Rishabh Jain, Krishna K. Nagar, Jason D. Bakos, "FPGA vs. GPU for Sparse Matrix Vector Multiply," IEEE International Conference on Field-Programmable Technology (IC-FPT'09), Dec. 9-11, Krishna K. Nagar, Yan Zhang, Jason D. Bakos, "An Integrated Reduction Technique for a Double Precision Accumulator," Proc. Third International Workshop on High-Performance Reconfigurable Computing Technology and Applications (HPRCTA'09), held in conjunction with Supercomputing 2009 (SC'09), Nov. 15, Jason D. Bakos, Krishna K. Nagar, "Exploiting Matrix Symmetry to Improve FPGA-Accelerated Conjugate Gradient," 17th Annual IEEE International Symposium on Field Programmable Custom Computing Machines (FCCM'09), April 5-8, 2009.

Our Projects CSCE 791April 2, Multi-FPGA System Architectures 1.Jason D. Bakos, Charles L. Cathey, E. Allen Michalski, "Predictive Load Balancing for Interconnected FPGAs," 16th International Conference on Field Programmable Logic and Applications (FPL'06), Madrid, Spain, August 28-30, Charles L. Cathey, Jason D. Bakos, Duncan A. Buell, "A Reconfigurable Distributed Computing Fabric Exploiting Multilevel Parallelism," 14th Annual IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM'06), April 24-26, GPU Simulation 1.Patrick A. Moran, Jason D. Bakos, "A PTX Simulator for Performance Tuning CUDA Code," IEEE Trans. on Parallel and Distributed Systems, submitted.

Task Partitioning for Heterogeneous Computing CSCE 791April 2,

GPU and FPGA Acceleration of Data Mining CSCE 791April 2,

Logic Minimization There are different representations of a Boolean functions Truth table representation: F :B 3 → Y  Y:ON-Set = {000, 010, 100, 101}  OFF-Set = {011, 110}  DC-Set = {111} CSCE 791April 2, abcY *

Logic Minimization Heuristics CSCE 791April 2, Looking for a cover for ON-Set. Here is basic steps of the Heuristic Algorithm: 1- P ← {} 2- Select an element from ON-Set {000} 3- Expand {000} to find Primes {a' c', b'} 4- Select the biggest from the set P ← P U {b'} 5- Find another element in ON-Set which is not covered yet {010} and goto step-2.

Acknowledgement Heterogeneous and Reconfigurable Computing Group Zheming Jin Tiffany Mintz Krishna Nagar Jason BakosYan Zhang CSCE 791April 2,