Isaac Lyngaas John Paige Advised by: Srinath Vadlamani & Doug Nychka SIParCS,

Slides:

Advertisements

Similar presentations

Dense Linear Algebra (Data Distributions) Sathish Vadhiyar.

Advertisements

MAGMA – LAPACK for HPC on Heterogeneous Architectures MAGMA – LAPACK for HPC on Heterogeneous Architectures Stan Tomov and Jack Dongarra Research Director.

Yafeng Yin, Lei Zhou, Hong Man 07/21/2010

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

Lecture 6: Multicore Systems

Monte-Carlo method and Parallel computing  An introduction to GPU programming Mr. Fang-An Kuo, Dr. Matthew R. Smith NCHC Applied Scientific Computing.

Solving Challenging Numerical Linear Algebra Algorithms Using GPU Accelerators Hatem Ltaief KAUST Supercomputing Laboratory Stanimire Tomov University.

XEON PHI. TOPICS What are multicore processors? Intel MIC architecture Xeon Phi Programming for Xeon Phi Performance Applications.

Early Linpack Performance Benchmarking on IPE Mole-8.5 Fermi GPU Cluster Xianyi Zhang 1),2) and Yunquan Zhang 1),3) 1) Laboratory of Parallel Software.

HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.

Why GPU Computing. GPU CPU Add GPUs: Accelerate Science Applications © NVIDIA 2013.

Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. GPU Sparse LU Factorization and Its Application.

Eigenvalue and eigenvectors  A x = λ x  Quantum mechanics (Schrödinger equation)  Quantum chemistry  Principal component analysis (in data mining)

OpenFOAM on a GPU-based Heterogeneous Cluster

The Problem With The Linpack Benchmark 1.0 Matrix Generator Jack J. Dongarra and Julien Langou International Journal of High Performance Computing Applications.

Parallelization and CUDA libraries Lei Zhou, Yafeng Yin, Hong Man.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.

ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

GPU Programming with CUDA – Accelerated Architectures Mike Griffiths

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Direct Self-Consistent Field Computations on GPU Clusters Guochun.

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.

CS6963 L15: Design Review and CUBLAS Paper Discussion.

The Future of LAPACK and ScaLAPACK Jim Demmel UC Berkeley CScADS Autotuning Workshop 9 July 2007.

1 © 2012 The MathWorks, Inc. Parallel computing with MATLAB.

Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,

YOU LI SUPERVISOR: DR. CHU XIAOWEN CO-SUPERVISOR: PROF. LIU JIMING THURSDAY, MARCH 11, 2010 Speeding up k-Means by GPUs 1.

Thinking Outside of the Tera-Scale Box Piotr Luszczek.

PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.

Parallelization of the Classic Gram-Schmidt QR-Factorization

Robert Liao Tracy Wang CS252 Spring Overview Traditional GPU Architecture The NVIDIA G80 Processor CUDA (Compute Unified Device Architecture) LAPACK.

Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,

Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.

Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008.

CS 240A Applied Parallel Computing John R. Gilbert Thanks to Kathy Yelick and Jim Demmel at UCB for.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.

ICAL GPU 架構中所提供分散式運算之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.

GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.

SuperMatrix on Heterogeneous Platforms Jianyu Huang SHPC, UT Austin 1.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Yang Yu, Tianyang Lei, Haibo Chen, Binyu Zang Fudan University, China Shanghai Jiao Tong University, China Institute of Parallel and Distributed Systems.

Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Performance Analysis of Parallel Sparse LU Factorization.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.

Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs Allen D. Malony, Scott Biersdorff, Sameer Shende, Heike Jagode†, Stanimire.

Workshop on Advanced Computing for Accelerators Day 3 Roger Barlow.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.

Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.

Scaling up R computation with high performance computing resources.

An Out-of-core Implementation of Block Cholesky Decomposition on A Multi-GPU System Lin Cheng, Hyunsu Cho, Peter Yoon, Jiajia Zhao Trinity College, Hartford,

S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.

Multicore Applications in Physics and Biochemical Research Hristo Iliev Faculty of Physics Sofia University “St. Kliment Ohridski” 3 rd Balkan Conference.

Introduction to Data Analysis with R on HPC Texas Advanced Computing Center Feb

LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

Sobolev(+Node 6, 7) Showcase +K20m GPU Accelerator.

Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.

1/24 UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU.

NFV Compute Acceleration APIs and Evaluation

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

A survey of Exascale Linear Algebra Libraries for Data Assimilation

R. Rastogi, A. Srivastava , K. Sirasala , H. Chavhan , K. Khonde

Scientific requirements and dimensioning for the MICADO-SCAO RTC

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON GPUs

Nathan Grabaskas: Batched LA and Parallel Communication Optimization

Multicore and GPU Programming

Presentation transcript:

Isaac Lyngaas John Paige Advised by: Srinath Vadlamani & Doug Nychka SIParCS, July 31, 2014

 Why use HPC with R?  Accelerating mKrig & Krig  Parallel Cholesky ◦ Software Packages  Parallel Eigen Decomposition  Conclusions & Future Works

 Accelerate the ‘Fields’ Krig and mKrig functions  Survey of parallel linear algebra software ◦ Multicore (Shared Memory) ◦ GPU ◦ Xeon Phi

 Many developers & users in the field of Statistics ◦ Readily available code base  Problem: R is slow for large size problems

 Bottleneck in Linear Algebra operations ◦ mKrig – Cholesky Decomposition ◦ Krig – Eigen Decomposition  R uses sequential algorithms  Strategy: Use C interoperable libraries to parallelize linear algebra ◦ C functions callable through R environment

 Symmetric positive definite ->Triangular ◦ A = LL^T ◦ Nice properties for determinant calculation

 PLASMA (Multicore Shared Memory) ◦  MAGMA (GPU & Xeon Phi) ◦  CULA (GPU) ◦

 Multicore (Shared Memory)  Block Scheduling ◦ Determines what operations should be done on which core  Block Size optimization ◦ Dependent on Cache Memory

0 5 Speedup v s. 1 Core Plasma using 1 Node (# of Observations = 25000) 8 # of Cores Speedup Optimal Speedup

Mb1000 Block Size Time(sec) 5 PLASMA on Dual Socket Sandy Bridge (# of Observations=15000, Core=16) 256 Kb

# of Observations PLASMA Optimal Block Sizes (Cores=16) Optimal Block si z e

 Utilizes GPUs or Xeon Phi for parallelization ◦ Multiple GPU & Multiple Xeon Phi implementations available ◦ 1 CPU drives one 1GPU  Block Scheduling ◦ Similar to PLASMA  Block Size dependent on Accelerator Architecture

 CUDA Proprietary linear algebra package  Capable of doing Lapack operations using 1 GPU  API written in C  Dense & Spare operations available

 1 Node of Caldera or Pronghorn ◦ 2 x 8 core Intel Xeon E (Sandy Bridge) processors per Node  64 GB RAM (~59 GB available)  Cache Per Core: L1=32Kb, L2 =256Kb  Cache Per Socket: L3=20Mb ◦ 2 x Nvidia Tesla M270Q GPU (Caldera)  ~5.2 GB RAM per device  1 core drives 1 GPU ◦ 2 x Xeon Phi 5110P (Pronghorn)  ~7.4 GB RAM per device

Serial R: ~3 GFLOP/sec Theoretical Peak Performance 16 core Xeon SandyBridge: ~333 GFLOP/sec 1 Nvidia Tesla M2070Q: ~512 GFLOP/sec 1 Xeon Phi 5110P: ~1,011 GFLOP/sec # of Observations GFLOP/sec Accelerated Hardware has Room for Improvement Plasma (16 cores) Magma 1 GPU Magma 2 GPUs Magma 1 MIC Magma 2 MICs CULA

All Parallel Cholesky Implementations are Faster than Serial R # of Observations Time(sec) Serial R Plasma (16 Cores) CULA Magma 1 GPU Magma 2 GPUs Magma 1 Xeon Phi Magma 2 Xeon Phis >100 Times Speedup over serial R when # of Observations = 10k

~6 Times Speedup over serial R when # of Observations = 10k Time(sec) Eigendecomposition also Faster on Accelerated Hardware # of Observations Serial R CULA Magma 1 GPU Magma 2 GPUs

Both times taken using MAGMA w/ 2 GPUs Can Run ~30 Cholesky Decompositions per Eigen Decomposition # of Observations Time Eigendecomposition / Time Cholesky

If we want to do 16Cholesky decompositions in parallel, we are guaranteed better performance when speedup > # of Observations Parallel Cholesky Beats Parallel R for Moderate to Large Matrices Speedup v s. P a r alle l R Plasma Magma 2 GPUs

 Using Caldera ◦ Single Cholesky Decomposition ◦ Matrix Size < 20k use PLASMA (16 cores w/ optimal block size) ◦ Matrix Size 20k – 35k use MAGMA w/ 2 GPUs ◦ Matrix Size > 35k use PLASMA (16 cores w/ optimal block size)  Dependent on computing resources available

 Explored Implementation on accelerated hardware ◦ GPUs ◦ Multicore (Shared Memory) ◦ Xeon Phis  Installed third party linear algebra packages & programmed wrappers that call these packages from R ◦ Installation instructions and programs available through bitbucket repo for access contact Srinath Vadlamani  Future Work ◦ Multicore Distributed Memory ◦ Single Precision

Douglas Nychka, Reinhard Furrer, and Stephan Sain. fields: Tools for spatial data, 2014b. URL: R package version Emmanuel Agullo, Jim Demmel, Jack Dongarra, Bilel Hadri, Jakub Kurzak, Julien Langou, Hatem Ltaief, Piotr Luszczek, and Stanimire Tomov. Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects. In Journal of Physics: Conference Series, volume 180, page IOP Publishing, Hatem Ltaief, Stanimire Tomov, Rajib Nath, Peng Du, and Jack Dongarra. A Scalable High Performant Cholesky Factorization for Multicore with GPU Accelerators. Proc. Of VECPAR’10, Berkeley, CA, June22-25, Jack Dongarra, Mark Gates, Azzam Haidar, Yulu Jia, Khairul Kabir, Piotr Luszczek, and Stanimire Tomov. Portable HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi. PPAM 2013, Warsaw, Poland, September, 2013.

xPOTRF xTRSMxTRSM xTRSMxTRSMxTRSM xTRSMxTRSMxTRSMxTRSM xTRSMxTRSM xSYRK xGEMM xPOTRFxTRSMxSYRKxGEMM FINAL