1 Parallel Software for SemiDefinite Programming with Sparse Schur Complement Matrix Makoto Tokyo-Tech Katsuki Chuo University Mituhiro.

Slides:



Advertisements
Similar presentations
MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Advertisements

A Scalable and Reconfigurable Search Memory Substrate for High Throughput Packet Processing Sangyeun Cho and Rami Melhem Dept. of Computer Science University.
Parallel Matrix Operations using MPI CPS 5401 Fall 2014 Shirley Moore, Instructor November 3,
Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa.
Sum of Squares and SemiDefinite Programmming Relaxations of Polynomial Optimization Problems The 2006 IEICE Society Conference Kanazawa, September 21,
Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,
Eigenvalue and eigenvectors  A x = λ x  Quantum mechanics (Schrödinger equation)  Quantum chemistry  Principal component analysis (in data mining)
Scalable Stochastic Programming Cosmin Petra and Mihai Anitescu Mathematics and Computer Science Division Argonne National Laboratory Informs Computing.
OpenFOAM on a GPU-based Heterogeneous Cluster
Parallelized variational EM for Latent Dirichlet Allocation: An experimental evaluation of speed and scalability Ramesh Nallapati, William Cohen and John.
SDPA: Leading-edge Software for SDP Informs ’ 08 Tokyo Institute of Technology Makoto Yamashita Mituhiro Fukuda Masakazu Kojima Kazuhide Nakata.
DCABES 2009 China University Of Geosciences 1 The Parallel Models of Coronal Polarization Brightness Calculation Jiang Wenqian.
1 Titanium and UPCKathy Yelick UPC Benchmarks Kathy Yelick LBNL and UC Berkeley Joint work with The Berkeley UPC Group: Christian Bell, Dan Bonachea, Wei.
Science Advisory Committee Meeting - 20 September 3, 2010 Stanford University 1 04_Parallel Processing Parallel Processing Majid AlMeshari John W. Conklin.
Given UPC algorithm – Cyclic Distribution Simple algorithm does cyclic distribution This means that data is not local unless item weight is a multiple.
1 Parallel Simulations of Underground Flow in Porous and Fractured Media H. Mustapha 1,2, A. Beaudoin 1, J. Erhel 1 and J.R. De Dreuzy IRISA – INRIA.
ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan.
Simple Load Balancing CS550 Operating Systems. Announcements Project will be posted – TBA This project will use the client-server model and will require.
Scalable Stochastic Programming Cosmin G. Petra Mathematics and Computer Science Division Argonne National Laboratory Joint work with.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,
MUMPS A Multifrontal Massively Parallel Solver IMPLEMENTATION Distributed multifrontal.
Parallel Performance of Hierarchical Multipole Algorithms for Inductance Extraction Ananth Grama, Purdue University Vivek Sarin, Texas A&M University Hemant.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Direct Self-Consistent Field Computations on GPU Clusters Guochun.
Makoto Kudoh*1, Hisayasu Kuroda*1,
High Performance Solvers for Semidefinite Programs
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
1 High-Performance Implementation of Positive Matrix Completion for SDPs Makoto Yamashita (Tokyo Institute of Technology) Kazuhide Nakata (Tokyo Institute.
Sparsity in Polynomial Optimization IMA Annual Program Year Workshop "Optimization and Control" Minneapolis, January 16-20, 2007 Masakazu Kojima Tokyo.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
A comparison between a direct and a multigrid sparse linear solvers for highly heterogeneous flux computations A. Beaudoin, J.-R. De Dreuzy and J. Erhel.
Operating Systems. Definition An operating system is a collection of programs that manage the resources of the system, and provides a interface between.
Triangular Mesh Decimation
1 Eigenvalue Problems in Nanoscale Materials Modeling Hong Zhang Computer Science, Illinois Institute of Technology Mathematics and Computer Science, Argonne.
SuperLU_DIST on GPU Cluster Sherry Li FASTMath Meeting, Oct. 1-2, /2014 “A distributed CPU-GPU sparse direct solver”, P. Sao, R. Vuduc and X.S. Li, Euro-Par.
Compiling Several Classes of Communication Patterns on a Multithreaded Architecture Gagan Agrawal Department of Computer and Information Sciences Ohio.
On the Use of Sparse Direct Solver in a Projection Method for Generalized Eigenvalue Problems Using Numerical Integration Takamitsu Watanabe and Yusaku.
JAVA AND MATRIX COMPUTATION
Summary Background –Why do we need parallel processing? Moore’s law. Applications. Introduction in algorithms and applications –Methodology to develop.
Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.
1 Efficient Parallel Software for Large-Scale Semidefinite Programs Makoto Tokyo-Tech Katsuki Chuo University MSC Yokohama.
INFOMRS Charlotte1 Parallel Computation for SDPs Focusing on the Sparsity of Schur Complements Matrices Makoto Tokyo Tech Katsuki Fujisawa.
Data Management for Decision Support Session-4 Prof. Bharat Bhasker.
Introduction to Semidefinite Programs Masakazu Kojima Semidefinite Programming and Its Applications Institute for Mathematical Sciences National University.
GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.
1 Enclosing Ellipsoids of Semi-algebraic Sets and Error Bounds in Polynomial Optimization Makoto Yamashita Masakazu Kojima Tokyo Institute of Technology.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Efficiency of small size tasks calculation in grid clusters using parallel processing.. Olgerts Belmanis Jānis Kūliņš RTU ETF Riga Technical University.
08/10/ NRL Hybrid QR Factorization Algorithm for High Performance Computing Architectures Peter Vouras Naval Research Laboratory Radar Division Professor.
A Parallel, High Performance Implementation of the Dot Plot Algorithm Chris Mueller July 8, 2004.
Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Dec 1, 2005 Part 2.
1 VSIPL++: Parallel Performance HPEC 2004 CodeSourcery, LLC September 30, 2004.
Unit - 4 Introduction to the Other Databases.  Introduction :-  Today single CPU based architecture is not capable enough for the modern database.
Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.
Performance Evaluation of Parallel Algorithms on a Computational Grid Environment Simona Blandino 1, Salvatore Cavalieri 2 1 Consorzio COMETA, 2 Faculty.
PRACTICAL TIME BUNDLE ADJUSTMENT FOR 3D RECONSTRUCTION ON THE GPU Siddharth Choudhary ( IIIT Hyderabad ), Shubham Gupta ( IIIT Hyderabad ), P J Narayanan.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
An Out-of-core Implementation of Block Cholesky Decomposition on A Multi-GPU System Lin Cheng, Hyunsu Cho, Peter Yoon, Jiajia Zhao Trinity College, Hartford,
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.
Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui.
1 Des bulles dans le PaStiX Réunion NUMASIS Mathieu Faverge ScAlApplix project, INRIA Futurs Bordeaux 29 novembre 2006.
Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.
IM.Grid: A Grid Computing Solution for image processing
A computational loop k k Integration Newton Iteration
for more information ... Performance Tuning
A computational loop k k Integration Newton Iteration
Presentation transcript:

1 Parallel Software for SemiDefinite Programming with Sparse Schur Complement Matrix Makoto Tokyo-Tech Katsuki Chuo University Mituhiro Tokyo-Tech Yoshiaki University of Virginia Kazuhiro National Maritime Research Institute Masakazu Tokyo-Tech Kazuhide Tokyo-Tech Maho RIKEN ISMP Chicago [2009/08/26]

2 Extremely Large SDPs  Arising from various fields Quantum Chemistry Sensor Network Problems Polynomial Optimization Problems  Most computation time is related to Schur complement matrix (SCM)  [SDPARA]Parallel computation for SCM  In particular, sparse SCM

3 Outline 1.SemiDefinite Programming and Schur complement matrix 2.Parallel Implementation 3.Parallel for Sparse Schur complement 4.Numerical Results 5.Future works

4 Standard form of SDP

5 Primal-Dual Interior-Point Methods

6 Computation for Search Direction Schur complement matrix ⇒ Cholesky Factorizaiton Exploitation of Sparsity in1.ELEMENTS 2.CHOLESKY

7 Bottlenecks on Single Processor Apply Parallel Computation to the Bottlenecks in second Opteron 246 (2.0GHz) LiOHHF m ELEMENTS6150( 43%)16719( 35%) CHOLESKY7744( 54%)20995( 44%) TOTAL14250(100%)47483(100%)

8 SDPARA  SDPA parallel version (generic SDP solver)  MPI & ScaLAPACK  Row-wise distribution for ELEMENTS  parallel Cholesky factorization for CHOLESKY

9 Row-wise distribution for evaluation of the Schur complement matrix  4 CPU is available  Each CPU computes only their assigned rows .  No communication between CPUs  Efficient memory management

10 Parallel Cholesky factorization  We adopt Scalapack for the Cholesky factorization of the Schur complement matrix  We redistribute the matrix from row-wise to two-dimensional block-cyclic distribtuion Redistribution

11 Computation time on SDP from Quantum Chemistry [LiOH] AIST super cluster Opteron 246 (2.0GHz) 6GB memory/node

12 Sclability on SDP from Quantum Chemistry [NF] Total 29 times ELEMENTS 63 times CHOLESKY 39 times ELEMENTS is very effective

13 Sparse Schur complement matrix  Schur complement matrix becomes very sparse for some applications. ⇒ Simple Row-wise loses its efficiency from Control Theory (100%) from Sensor Network(2.12%)

14 Sparseness of Schur complement matrix  Many applications have diagonal block structure

15 Exploitation of Sparsity in SDPA  We change the formula by row-wise F1F1 F2F2 F3F3

16 ELEMENTS for Sparse Schur complement Load on each CPU CPU1:190 CPU2:185 CPU3:188

17 CHOLESKY for Sparse Schur complement  Parallel Sparse Cholesky factorization implemented in MUMPS  MUMPS adopts Multiple Frontal method Memory storage on each processor should be consecutive. The distribution for ELEMENTS matches this method.

18 Computation time for SDPs from Polynomial Optimization Problem tsubasa Xeon E5440 (2.83GHz) 8GB memory/node Parallel Sparse Cholesky achieves mild scalability. ELEMENTS attains 24x speed-up on 32 CPUs.

19 ELEMENTS Load-balance on 32 CPUs  Only first processor has a little heavier computation.

20 Automatic selection of sparse / dense SCM  Dense Parallel Cholesky achieves higher scalability than Sparse Parallel Cholesky  Dense becomes better for many processors.  We estimate both computation time using computation cost and scalability.

21 Sparse/Dense CHOLESKY for a small SDP from POP tsubasa Xeon E5440 (2.83GHz) 8GB memory/node Only on 4 CPUs, the auto selection failed. (since scalability on sparse cholesky is unstable on 4 CPUs.)

22 Numerical Results  Comparison with PCSDP Sensor Network Problem generated by SFSDP  Multi Threading Quantum Chemistry

23 SDPs from Sensor Network #sensors1,000 (m=16,450: density 1.23%) #CPU SDPARA PCSDPM.O #sensors35,000 (m=527,096: density ) #CPU SDPARA PCSDPMemory Over. if #sensors >= 4,000 (time unit : second)

24 MPI + Multi Threading for Quantum Chemistry N.4P.DZ.pqgt11t2p(m=7230) second 64x speed-up on [16nodesx8threads]

25 Concluding Remarks & Future works 1.New parallel schemes for sparse Schur complement matrix 2.Reasonable Scalability 3.Extremely large-scale SDPs with sparse Schur complement matrix  Improvement on Multi-Threading for sparse Schur complement matrix