Computación algebraica dispersa con GPUs y su aplicación en tomografía electrónica Non-linear iterative optimization method for locating particles using.

Slides:

Advertisements

Similar presentations

Yafeng Yin, Lei Zhou, Hong Man 07/21/2010

Advertisements

Monte-Carlo method and Parallel computing  An introduction to GPU programming Mr. Fang-An Kuo, Dr. Matthew R. Smith NCHC Applied Scientific Computing.

Motivation Desktop accelerators (like GPUs) form a powerful heterogeneous platform in conjunction with multi-core CPUs. To improve application performance.

Intro to GPU’s for Parallel Computing. Goals for Rest of Course Learn how to program massively parallel processors and achieve – high performance – functionality.

HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.

Patch-based Image Deconvolution via Joint Modeling of Sparse Priors Chao Jia and Brian L. Evans The University of Texas at Austin 12 Sep

Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

Eigenvalue and eigenvectors  A x = λ x  Quantum mechanics (Schrödinger equation)  Quantum chemistry  Principal component analysis (in data mining)

OpenFOAM on a GPU-based Heterogeneous Cluster

Fast Circuit Simulation on Graphics Processing Units Kanupriya Gulati † John F. Croix ‡ Sunil P. Khatri † Rahm Shastry ‡ † Texas A&M University, College.

Computes the partial dot products for only the diagonal and upper triangle of the input matrix. The vector computed by this architecture is added to the.

L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

1 Aug 7, 2004 GPU Req GPU Requirements for Large Scale Scientific Applications “Begin with the end in mind…” Dr. Mark Seager Asst DH for Advanced Technology.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

Using the Trigger Test Stand at CDF for Benchmarking CPU (and eventually GPU) Performance Wesley Ketchum (University of Chicago)

ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan.

HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

GPU Programming with CUDA – Accelerated Architectures Mike Griffiths

Approximating the Algebraic Solution of Systems of Interval Linear Equations with Use of Neural Networks Nguyen Hoang Viet Michal Kleiber Institute of.

Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

“High-performance computational GPU-stand for teaching undergraduate and graduate students the basics of quantum-mechanical calculations“ “Komsomolsk-on-Amur.

An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Thermoacoustics in random fibrous materials Seminar Carl Jensen Tuesday, March

10/17/97Optical Diffraction Tomography1 A.J. Devaney Department of Electrical Engineering Northeastern University Boston, MA USA

Accelerating Statistical Static Timing Analysis Using Graphics Processing Units Kanupriya Gulati & Sunil P. Khatri Department of ECE Texas A&M University,

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

Fast Thermal Analysis on GPU for 3D-ICs with Integrated Microchannel Cooling Zhuo Fen and Peng Li Department of Electrical and Computer Engineering, {Michigan.

Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors

Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,

YOU LI SUPERVISOR: DR. CHU XIAOWEN CO-SUPERVISOR: PROF. LIU JIMING THURSDAY, MARCH 11, 2010 Speeding up k-Means by GPUs 1.

Accelerating Statistical Static Timing Analysis Using Graphics Processing Units Kanupriya Gulati and Sunil P. Khatri Department of ECE, Texas A&M University,

PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.

Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences

Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,

Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.

IIIT Hyderabad Scalable Clustering using Multiple GPUs K Wasif Mohiuddin P J Narayanan Center for Visual Information Technology International Institute.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

Adam Wagner Kevin Forbes. Motivation  Take advantage of GPU architecture for highly parallel data-intensive application  Enhance image segmentation.

ICAL GPU 架構中所提供分散式運算之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

 Genetic Algorithms  A class of evolutionary algorithms  Efficiently solves optimization tasks  Potential Applications in many fields  Challenges.

GPU Programming Shirley Moore CPS 5401 Fall 2013

GPU Accelerated MRI Reconstruction Professor Kevin Skadron Computer Science, School of Engineering and Applied Science University of Virginia, Charlottesville,

Program Optimizations and Recent Trends in Heterogeneous Parallel Computing Dušan Gajić, University of Niš Program Optimizations and Recent Trends in Heterogeneous.

HAMA: An Efficient Matrix Computation with the MapReduce Framework Sangwon Seo, Edward J. Woon, Jaehong Kim, Seongwook Jin, Jin-soo Kim, Seungryoul Maeng.

CS/EE 217 GPU Architecture and Parallel Programming Midterm Review

Generic Compressed Matrix Insertion P ETER G OTTSCHLING – S MART S OFT /TUD D AG L INDBO – K UNGLIGA T EKNISKA H ÖGSKOLAN SmartSoft – TU Dresden

Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,

Linear Algebra Operators for GPU Implementation of Numerical Algorithms J. Krüger R. Westermann computer graphics & visualization Technical University.

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

Geant4 on GPU prototype Nicholas Henderson (Stanford Univ. / ICME)

Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.

Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.

GPU Acceleration of Particle-In-Cell Methods B. M. Cowan, J. R. Cary, S. W. Sides Tech-X Corporation.

Analysis of Sparse Convolutional Neural Networks

Stencil-based Discrete Gradient Transform Using

Employing compression solutions under openacc

A survey of Exascale Linear Algebra Libraries for Data Assimilation

Parallel Plasma Equilibrium Reconstruction Using GPU

GPU Computing Jan Just Keijser Nikhef Jamboree, Utrecht

Scientific requirements and dimensioning for the MICADO-SCAO RTC

I. E. Venetis1, N. Nikoloutsakos1, E. Gallopoulos1, John Ekaterinaris2

Bisection and Twisted SVD on GPU

for more information ... Performance Tuning

GPU Implementations for Finite Element Methods

Presentation transcript:

Computación algebraica dispersa con GPUs y su aplicación en tomografía electrónica Non-linear iterative optimization method for locating particles using HPC techniques G. Ortega 1, J. Lobera 2, I. García 3, M. P. Arroyo 2 and E. M. Garzón 1 1 Dpt. Computer Architecture and Electronics. University of Almería 2 Aragón Institute of Engineering Research (I3A). University of Zaragoza 3 Dpt. Computer Architecture. University of Málaga Heteropar 2014

Non-linear iterative optimization method for locating particles using HPC techniques 2 Introduction Non-Linear ODT model for locating particles (NLODT-P) Methodology to develop models GPU computing (Forward procedure) Validation of NLODT-P Evaluation of NLODT-P Conclusions Outline

Non-linear iterative optimization method for locating particles using HPC techniques ODT characteristics: ODT has been recently introduced in several fields. High accuracy with non-damaging radiation and its imaging capability to recover information from the object. 3 Optical Diffraction Tomography (ODT)

Non-linear iterative optimization method for locating particles using HPC techniques 4 Optical Diffraction Tomography (ODT) 2D holograms Laser ODT techniques

Non-linear iterative optimization method for locating particles using HPC techniques 5 2D holograms Laser ODT techniques A priori information INPUTS Set of seeding particle locations OUTPUT Diameter of the particles Refractive index of particles Fluid refractive index Optical Diffraction Tomography (ODT)

Non-linear iterative optimization method for locating particles using HPC techniques NLODT  Preliminary model 2D (proposed by J. Lobera and J.M. Coupland)  High computational cost. The resolution of large and sparse linear systems of equations of complex numbers in double precision was required. 6 Optical Diffraction Tomography (ODT) Linear Optical Diffraction Tomography (LODT) Non Linear Optical Diffraction Tomography (NLODT) Overcome the problems of multiple scattering in LODT High fidelity images Implementation and validation of a 3D NLODT model for the location of particles

Non-linear iterative optimization method for locating particles using HPC techniques Helmholtz equation : 7 Non-Linear ODT model for locating particles (NLODT-P) E r (r)  illuminating field. E s (r)  Scattering field. n (r)  Refractive index of the object to reconstruct. f (r)  Scattering potential. Reconstruction problem  Optimization problem Optimization Method CG Estimation of n(r)

Non-linear iterative optimization method for locating particles using HPC techniques 8 1: Compute PL(1) 2: It = 2 3: while It < iterMax and Value < threshold do 4: Update n(r) 5: for i = 1, 2,... until i < Nh do 6: E i s (r) = Forward (E i r (r), n(r)) 7: E i r,n (r) = E i s (r) + E i r (r) 8: E i c (r) = Filter (E i s (r), k i 0,NA) 9: E i m,n (r) = Forward ((E i c (r) − E i m (r)) ∗, n(r)) 10: E i m,n (r) = E i m,n (r) + (E i c (r) − E i m (r)) ∗ 11: g(r) ∗ = g(r) ∗ + (E i m,n (r), E i r,n (r)) 12: end for 13: gMF (r) = Matched Filtering(g(r), sample) 14: [PL(It), Value] = max(abs(gMF (r))) 15: It = It : end Non-Linear ODT model for locating particles (NLODT-P) Locate next particle Initial Iteration Compute the updated gradient, g(r) Iterative Process Location of the 1st particle Update the refractive index field

Non-linear iterative optimization method for locating particles using HPC techniques 9 1: Compute PL(1) 2: It = 2 3: while It < iterMax and Value < threshold do 4: Update n(r) 5: for i = 1, 2,... until i < Nh do 6: E i s (r) = Forward (E i r (r), n(r)) 7: E i r,n (r) = E i s (r) + E i r (r) 8: E i c (r) = Filter (E i s (r), k i 0,NA) 9: E i m,n (r) = Forward ((E i c (r) − E i m (r)) ∗, n(r)) 10: E i m,n (r) = E i m,n (r) + (E i c (r) − E i m (r)) ∗ 11: g(r) ∗ = g(r) ∗ + (E i m,n (r), E i r,n (r)) 12: end for 13: gMF (r) = Matched Filtering(g(r), sample) 14: [PL(It), Value] = max(abs(gMF (r))) 15: It = It : end Non-Linear ODT model for locating particles (NLODT-P) Initial Iteration Iterative Process Resolution Helmholtz equation 95% Runtime

Non-linear iterative optimization method for locating particles using HPC techniques 10 Methodology to develop models (MATLAB + GPU computing) Multithreaded High programming language level Visualization tools Libraries: GPUmat, Jacket, Mexfiles Accelerate procedures with high computational cost: Forward (Helmholtz equation)

Non-linear iterative optimization method for locating particles using HPC techniques 11 GPU computing (Forward procedure) 11 Green’s Functions Spatial Discretization (based on FEM) Large linear system of equations M is sparse, symmetric and with a regular pattern Linear Eliptic Partial Differential of Equations (PDE) Biconjugate Gradient Method (BCG)

Non-linear iterative optimization method for locating particles using HPC techniques 12 Regular Format dots saxpy SpMV Biconjugate Gradient Method (BCG) GPU computing (Forward procedure)

Non-linear iterative optimization method for locating particles using HPC techniques 13 GPU computing (Forward procedure) Regularities of M 1.Complex symmetric matrix 2.Max 7 nonzeros/row 3.Nonzeros are located by 7 diagonals 4.Same values for lateral diagonals (a, b, c) Mem. Req. (GB) for storing M: VolTP CRS ELLR-T Reg Format

Non-linear iterative optimization method for locating particles using HPC techniques CUDA interface. CUBLAS library for saxpy and dot operations. Optimization techniques: The reading of the sparse matrix and data involved in vector operations are coalesced global memory access, this way the bandwidth of global memory is maximized. Shared memory and registers are used to store any intermediate data. 14 GPU computing (Forward procedure)

Non-linear iterative optimization method for locating particles using HPC techniques 15 Validation of NLODT-P Volume to reconstruct 4 particles of 2µm = 0.633µm NA = 0.55 Vol= N h = 1 A PRIORI INFORMATION: Diameter of the particles Refractive index of particles Fluid refractive index

Non-linear iterative optimization method for locating particles using HPC techniques 16 Validation of NLODT-P 1ª iteration Evolution of the gradient of the cost function, g(r) Evolution of the distribution of the refractive index, n(r) 4ª iteration 2ª iteration 3ª iteration

Non-linear iterative optimization method for locating particles using HPC techniques 17 Evaluation of NLODT-P Tesla M2090 Peak performance (double precision) (GFlops) 665 Peak performance (simple precision) (GFlops) 1331 Device memory (GB)6 Clock rate (GHz)1.3 Memory bandwidth (GBytes/sec ) 177 Multiprocessors16 CUDA cores512 Compute Capability2 Year architecture2011 DRAM TYPEGDDR5 GPU architecture (CUDA programming interface) Multicore: Intel Xeon E5620 (8 cores) (2.4 GHz, 48 GB RAM) under Linux.

Non-linear iterative optimization method for locating particles using HPC techniques 18 Evaluation of NLODT-P 4 particles of 0.5µm = 0.633µm NA = 0.55 Vol= N h = 3 iterMax=4

Non-linear iterative optimization method for locating particles using HPC techniques A physical model of Non-linear ODT for the application in velocimetry techniques has been implemented and evaluated over 3D prototypes of interest (NLODT-P). MATLAB framework  productivity in the generation of codes. Integration of GPU computing and MATLAB  Mex-files. Illustration of a methodology for the development of scientific applications. GPU memory is a limiting factor. Future works: distributed implementation. 19 Conclusions

Non-linear iterative optimization method for locating particles using HPC techniques Distributed version of Forward 20 Hybrid Implementation (BCG-3DH) MultiGPU ImplementationMulticore CPU Implementation + Solve larger problems Runtime is reduced CUDAMessage Passing Interface (MPI) Conclusions

Non-linear iterative optimization method for locating particles using HPC techniques 21 Runtime (s) of 1000 iterations of the BCG-3DH using 8GPUs+8CPUs versus 8GPUs Vol Conclusions

Non-linear iterative optimization method for locating particles using HPC techniques Contraportada 22