High Performance Solvers for Semidefinite Programs

Slides:



Advertisements
Similar presentations
Primal Dual Combinatorial Algorithms Qihui Zhu May 11, 2009.
Advertisements

Parallel Jacobi Algorithm Steven Dong Applied Mathematics.
Sum of Squares and SemiDefinite Programmming Relaxations of Polynomial Optimization Problems The 2006 IEICE Society Conference Kanazawa, September 21,
Eigenvalue and eigenvectors  A x = λ x  Quantum mechanics (Schrödinger equation)  Quantum chemistry  Principal component analysis (in data mining)
Venkataramanan Balakrishnan Purdue University Applications of Convex Optimization in Systems and Control.
Sparse Triangular Solve in UPC By Christian Bell and Rajesh Nishtala.
Benchmarking Parallel Code. Benchmarking2 What are the performance characteristics of a parallel code? What should be measured?
1cs542g-term Notes  Assignment 1 will be out later today (look on the web)
1cs542g-term Notes  Assignment 1 is out (questions?)
SDPA: Leading-edge Software for SDP Informs ’ 08 Tokyo Institute of Technology Makoto Yamashita Mituhiro Fukuda Masakazu Kojima Kazuhide Nakata.
CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.
Jie Gao Joint work with Amitabh Basu*, Joseph Mitchell, Girishkumar Stony Brook Distributed Localization using Noisy Distance and Angle Information.
ISPDC 2007, Hagenberg, Austria, 5-8 July On Grid-based Matrix Partitioning for Networks of Heterogeneous Processors Alexey Lastovetsky School of.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Solving the Protein Threading Problem in Parallel Nocola Yanev, Rumen Andonov Indrajit Bhattacharya CMSC 838T Presentation.
Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.
P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism Efficient Longest Common Subsequence Computation using Bulk-Synchronous.
A Fault-tolerant Architecture for Quantum Hamiltonian Simulation Guoming Wang Oleg Khainovski.
1 Parallel Simulations of Underground Flow in Porous and Fractured Media H. Mustapha 1,2, A. Beaudoin 1, J. Erhel 1 and J.R. De Dreuzy IRISA – INRIA.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Antonio M. Vidal Jesús Peinado
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Direct Self-Consistent Field Computations on GPU Clusters Guochun.
Optimization for Operation of Power Systems with Performance Guarantee
Binary Image Compression via Monochromatic Pattern Substitution: A Sequential Speed-Up Luigi Cinque and Sergio De Agostino Computer Science Department.
Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Parallel and Distributed Systems Instructor: Xin Yuan Department of Computer Science Florida State University.
1 High-Performance Implementation of Positive Matrix Completion for SDPs Makoto Yamashita (Tokyo Institute of Technology) Kazuhide Nakata (Tokyo Institute.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Sparsity in Polynomial Optimization IMA Annual Program Year Workshop "Optimization and Control" Minneapolis, January 16-20, 2007 Masakazu Kojima Tokyo.
Static Translation of Stream Programs S. M. Farhad School of Information Technology The University of Sydney.
ParCFD Parallel computation of pollutant dispersion in industrial sites Julien Montagnier Marc Buffat David Guibert.
Accelerating Statistical Static Timing Analysis Using Graphics Processing Units Kanupriya Gulati and Sunil P. Khatri Department of ECE, Texas A&M University,
PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.
Embedding Constraint Satisfaction using Parallel Soft-Core Processors on FPGAs Prasad Subramanian, Brandon Eames, Department of Electrical Engineering,
Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.
On the Use of Sparse Direct Solver in a Projection Method for Generalized Eigenvalue Problems Using Numerical Integration Takamitsu Watanabe and Yusaku.
1 Efficient Parallel Software for Large-Scale Semidefinite Programs Makoto Tokyo-Tech Katsuki Chuo University MSC Yokohama.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
INFOMRS Charlotte1 Parallel Computation for SDPs Focusing on the Sparsity of Schur Complements Matrices Makoto Tokyo Tech Katsuki Fujisawa.
Parco Auto-optimization of linear algebra parallel routines: the Cholesky factorization Luis-Pedro García Servicio de Apoyo a la Investigación Tecnológica.
Introduction to Semidefinite Programs Masakazu Kojima Semidefinite Programming and Its Applications Institute for Mathematical Sciences National University.
October 2008 Integrated Predictive Simulation System for Earthquake and Tsunami Disaster CREST/Japan Science and Technology Agency (JST)
1 Enclosing Ellipsoids of Semi-algebraic Sets and Error Bounds in Polynomial Optimization Makoto Yamashita Masakazu Kojima Tokyo Institute of Technology.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
08/10/ NRL Hybrid QR Factorization Algorithm for High Performance Computing Architectures Peter Vouras Naval Research Laboratory Radar Division Professor.
1 Adaptive Parallelism for Web Search Myeongjae Jeon Rice University In collaboration with Yuxiong He (MSR), Sameh Elnikety (MSR), Alan L. Cox (Rice),
1 Ellipsoid-type Confidential Bounds on Semi-algebraic Sets via SDP Relaxation Makoto Yamashita Masakazu Kojima Tokyo Institute of Technology.
Toward an Automatically Tuned Dense Symmetric Eigensolver for Shared Memory Machines Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya.
1 Parallel Software for SemiDefinite Programming with Sparse Schur Complement Matrix Makoto Tokyo-Tech Katsuki Chuo University Mituhiro.
PARALLEL COMPUTATION FOR MATRIX MULTIPLICATION Presented By:Dima Ayash Kelwin Payares Tala Najem.
Performance of BLAS-3 Based Tridiagonalization Algorithms on Modern SMP Machines Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya University.
Circuit Simulation using Matrix Exponential Method Shih-Hung Weng, Quan Chen and Chung-Kuan Cheng CSE Department, UC San Diego, CA Contact:
Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
LDL’ = PAP’ Speaker : 高崇閔. Program structure[1] Use high precision Main function 1_1 pivot do 1_1 pivot in case_1 to case_3 main.cpp test_matrix.cpp test.
Iterative LP and SOCP-based approximations to semidefinite and sum of squares programs Georgina Hall Princeton University Joint work with: Amir Ali Ahmadi.
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui.
TEMPLATE DESIGN © H. Che 2, E. D’Azevedo 1, M. Sekachev 3, K. Wong 3 1 Oak Ridge National Laboratory, 2 Chinese University.
1 An approach based on shortest path and connectivity consistency for sensor network localization problems Makoto Yamashita (Tokyo Institute of Technology)
Multicore Applications in Physics and Biochemical Research Hristo Iliev Faculty of Physics Sofia University “St. Kliment Ohridski” 3 rd Balkan Conference.
Data Driven Resource Allocation for Distributed Learning
Amir Ali Ahmadi (Princeton University)
ACCELERATING SPARSE CHOLESKY FACTORIZATION ON GPUs
Georgina Hall Princeton, ORFE Joint work with Amir Ali Ahmadi
Nathan Grabaskas: Batched LA and Parallel Communication Optimization
Polynomial DC decompositions
Numerical Algorithms Quiz questions
Presentation transcript:

High Performance Solvers for Semidefinite Programs This talk is supported by Ewha University High Performance Solvers for Semidefinite Programs Makoto Yamashita @ Tokyo Tech Katsuki Fujisawa @ Chuo Univ Mituhiro Fukuda @ Tokyo Tech Kazuhiro Kobayashi @ NMRI Kazuhide Nakata @ Tokyo Tech Maho Nakata @ RIKEN KSIAM Annual Meeting @ Jeju 2011/11/25 (2011/11/25-2011/11/26)

Our interests & SDPA Family How fast can we solve SDPs? How large SDP can we solve? How accurate can we solve SDPs? Parallel SDPA SDPARA SDPA-M SDPARA-C SDPA-C SDPA-GMP Matlab Base solver Multiple precision Strucutural Sparsity SDPA Homepage http://sdpa.sf.net/ KSIAM 2011 @ Jeju

SDPA Online Solver http://sdpa.sf.net/ ⇒ Online Solver Log-in the online solver Upload your problem Push ’Execute’ button Receive the result via Web/Mail KSIAM 2011 @ Jeju

Outline SDP Applications Primal-Dual Interior-Point Methods Inside of SDPARA (Large & Fast) Inside of SDPA-GMP (Accurate) Conclusion

SDP Applications Control Theory Quantum Chemistry Sensor Network Localization Problem Polynomial Optimization KSIAM 2011 @ Jeju

SDP Applications 1.Control theory Against swing, we want to keep stability. Stability Condition ⇒ Lyapnov Condition ⇒ SDP INFOMRS 2011 @ Charlotte 6

SDP Applications 2. Quantum Chemistry Ground state energy Locate electrons Schrodinger Equation ⇒Reduced Density Matrix ⇒SDP INFOMRS 2011 @ Charlotte 7

SDP Applications 3. Sensor Network Localization Distance Information ⇒Sensor Locations Protein Structure INFOMRS 2011 @ Charlotte 8

SDP Applications 4. Polynomial Optimization For example, NP-hard in general Very good lower bound by SDP relaxation method KSIAM 2011 @ Jeju 9

How Large & How Fast & How Accurate SDP Applications Control Theory Quantum Chemistry Polynomial Optimization Sensor Network Localization Problem Many Applications  How Large & How Fast & How Accurate KSIAM 2011 @ Jeju 10

Standard form Our target The variables are Inner Product is The size is roughly determined by Ordinal solver Our target KSIAM 2011 @ Jeju

Primal-Dual Interior-Point Methods Central Path Target Optimal Feasible region KSIAM 2011 @ Jeju

Schur Complement Matrix Schur Complement Equation Schur Complement Matrix where 1. ELEMENTS (Evaluation of SCM) 2. CHOLESKY (Cholesky factorization of SCM) KSIAM 2011 @ Jeju

Computation time on single processor Time unit is second, SDPA 7, Xeon 5460 (3.16GHz) Control POP ELEMENTS 22228 668 CHOLESKY 1593 1992 Total 23986 2713 Row-wise distribution Two-dimensional block-cyclic distribution SDPARA replaces these bottleneks by parallel computation KSIAM 2011 @ Jeju

Row-wise distribution Example All rows are independent Assign processors in a cyclic manner Simple idea ⇒Very EFFICIENT High scalability Processor1 Processor2 Processor3 Processor4 KSIAM 2011 @ Jeju

Block Algorithm for Cholesky factorization Triangular Factorization (U: upper triangular matrix) Small Cholesky factorizaton Block Updates Parallel Computing

Two-dimensional block-cyclic distribution Example Scalapack library From the row-wise to TDBCD requires network communication Cholesky on TDBCD is much faster than the on row-wise Processor1 Processor2 Processor3 Processor4 1 2 3 4 KSIAM 2011 @ Jeju

Numerical Results of SDPARA Quantum Chemistry (m=7230, SCM=100%), middle size SDPARA 7.3.1, Xeon X5460, 3.16GHz x2, 48GB memory ELEMENTS 15x speedup CHOLESKY 12x speedup Total 13x speedup Very FAST!! KSIAM 2011 @ Jeju

Acceleration by Multiple Threading Modern Processors have multi-cores Multiple Threading is becoming common Processor1:Thread1 Processor2:Thread1 Processor1:Thread2 Processor2:Thread2 2 Processors x2 Threads on each processor Two-level Parallel Computing KSIAM 2011 @ Jeju

(Two-level parallization) Comparison with PCSDP developed by Ivanov & de Klerk SDP: B.2P Quantum Chemistry (m = 7230, SCM = 100%) Xeon X5460, 3.16GHz x2 (8core), 48GB memory Time unit is second Servers 1 2 4 8 16 PCSDP 53,768 27,854 14,273 7995 4050 SDPARA 5983 3002 1680 901 565 SDPARA is 8x faster by MPI & Multi-Threading (Two-level parallization) KSIAM 2011 @ Jeju

Extremely Large-Scale SDPs Other solvers can handle only m SCM time Esc32_b(QAP) 198,432 100% 129,186 second (1.5days) 16 Servers [Xeon X5670(2.93GHz) , 128GB Memory] The LARGEST solved SDP in the world KSIAM 2011 @ Jeju

Numerical Accuracy One weakpoint of PDIPM . PDIPM requires Eventually, numerical trouble (often, Cholesky fails) for example, KSIAM 2011 @ Jeju

c c Numerical Precision b b a a SDPA-GMP Ordinal double precision in C or C++ arbitrary precision in GMP library b c a 64bit = 1bit(sign) + 11bit(exponent)+53bit(fraction); accuracy = b c a We can arbitrary set the bit number of fraction part. (for example, 200bit = ) Replace BLAS(Basic Linear Algebra Sytems) by MPLAPACK (Multiple precision LAPACK) SDPA-GMP

Numerically Hard problem Test Problem PDIPM is stable if Slater’s condition Graph Partition Problem has no interior Small ⇒ Numerically Hard KSIAM 2011 @ Jeju

Numerical Results of SDPA-GMP Small ⇒ Numerically Hard Solver Accuracy Time(second) 1.0e-1 SDPA 1.08e-8 2.03 SDPA-GMP 4.80e-48 77760.19 1.0e-15 1.63e-7 2.26 2.97e-48 82115.52 5.26e-9 2.36 7.29e-24 105325.74 24digits for even no-interior case SDPA-GMP uses 300 digits KSIAM 2011 @ Jeju 25

Conclusion SDPARA ⇒ How Fast & How Large 100times & SDPA-GMP ⇒ How Accurate http://sdpa.sf.net/ & Online solver Thank you very much for your attention. KSIAM 2011 @ Jeju