Scalable Stochastic Programming Cosmin G. Petra Mathematics and Computer Science Division Argonne National Laboratory Joint work with.

Slides:



Advertisements
Similar presentations
Dense Linear Algebra (Data Distributions) Sathish Vadhiyar.
Advertisements

Load Balancing Parallel Applications on Heterogeneous Platforms.
Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv: )
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
Electrical and Computer Engineering Mississippi State University
A Discrete Adjoint-Based Approach for Optimization Problems on 3D Unstructured Meshes Dimitri J. Mavriplis Department of Mechanical Engineering University.
Eigenvalue and eigenvectors  A x = λ x  Quantum mechanics (Schrödinger equation)  Quantum chemistry  Principal component analysis (in data mining)
Graph Laplacian Regularization for Large-Scale Semidefinite Programming Kilian Weinberger et al. NIPS 2006 presented by Aggeliki Tsoli.
Scalable Stochastic Programming Cosmin Petra and Mihai Anitescu Mathematics and Computer Science Division Argonne National Laboratory Informs Computing.
Numerical Parallel Algorithms for Large-Scale Nanoelectronics Simulations using NESSIE Eric Polizzi, Ahmed Sameh Department of Computer Sciences, Purdue.
Numerical Algorithms Matrix multiplication
Applications of Stochastic Programming in the Energy Industry Chonawee Supatgiat Research Group Enron Corp. INFORMS Houston Chapter Meeting August 2, 2001.
OpenFOAM on a GPU-based Heterogeneous Cluster
Stochastic Optimization of Complex Energy Systems on High-Performance Computers Cosmin G. Petra Mathematics and Computer Science Division Argonne National.
Multilevel Incomplete Factorizations for Non-Linear FE problems in Geomechanics DMMMSA – University of Padova Department of Mathematical Methods and Models.
Scalable Multi-Stage Stochastic Programming Cosmin Petra and Mihai Anitescu Mathematics and Computer Science Division Argonne National Laboratory DOE Applied.
ISPDC 2007, Hagenberg, Austria, 5-8 July On Grid-based Matrix Partitioning for Networks of Heterogeneous Processors Alexey Lastovetsky School of.
Distributed Combinatorial Optimization
Thomas algorithm to solve tridiagonal matrices
Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.
Higher-order Confidence Intervals for Stochastic Programming using Bootstrapping Cosmin G. Petra Joint work with Mihai Anitescu Mathematics and Computer.
ME964 High Performance Computing for Engineering Applications “The real problem is not whether machines think but whether men do.” B. F. Skinner © Dan.
On parallelizing dual decomposition in stochastic integer programming
1 Capacity Planning under Uncertainty Capacity Planning under Uncertainty CHE: 5480 Economic Decision Making in the Process Industry Prof. Miguel Bagajewicz.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
MUMPS A Multifrontal Massively Parallel Solver IMPLEMENTATION Distributed multifrontal.
Parallel Performance of Hierarchical Multipole Algorithms for Inductance Extraction Ananth Grama, Purdue University Vivek Sarin, Texas A&M University Hemant.
Optimization Under Uncertainty: Structure-Exploiting Algorithms Victor M. Zavala Assistant Computational Mathematician Mathematics and Computer Science.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Calculating Discrete Logarithms John Hawley Nicolette Nicolosi Ryan Rivard.
Scalable Stochastic Programming Cosmin G. Petra Mathematics and Computer Science Division Argonne National Laboratory Joint work with.
Solving Scalar Linear Systems Iterative approach Lecture 15 MA/CS 471 Fall 2003.
Using LU Decomposition to Optimize the modconcen.m Routine Matt Tornowske April 1, 2002.
Are their more appropriate domain-specific performance metrics for science and engineering HPC applications available then the canonical “percent of peak”
Scalabilities Issues in Sparse Factorization and Triangular Solution Sherry Li Lawrence Berkeley National Laboratory Sparse Days, CERFACS, June 23-24,
Optimization for Operation of Power Systems with Performance Guarantee
Scalable Multi-Stage Stochastic Programming
ParCFD Parallel computation of pollutant dispersion in industrial sites Julien Montagnier Marc Buffat David Guibert.
Statistical Sampling-Based Parametric Analysis of Power Grids Dr. Peng Li Presented by Xueqian Zhao EE5970 Seminar.
Scalable Symbolic Model Order Reduction Yiyu Shi*, Lei He* and C. J. Richard Shi + *Electrical Engineering Department, UCLA + Electrical Engineering Department,
Stochastic optimization of energy systems Cosmin Petra Argonne National Laboratory.
1 Incorporating Iterative Refinement with Sparse Cholesky April 2007 Doron Pearl.
Javad Lavaei Department of Electrical Engineering Columbia University Joint work with Somayeh Sojoudi and Ramtin Madani An Efficient Computational Method.
1 Efficient Parallel Software for Large-Scale Semidefinite Programs Makoto Tokyo-Tech Katsuki Chuo University MSC Yokohama.
INFOMRS Charlotte1 Parallel Computation for SDPs Focusing on the Sparsity of Schur Complements Matrices Makoto Tokyo Tech Katsuki Fujisawa.
Computing Simulation in Orders Based Transparent Parallelizing Pavlenko Vitaliy Danilovich, Odessa National Polytechnic University Burdeinyi Viktor Viktorovych,
On implicit-factorization block preconditioners Sue Dollar 1,2, Nick Gould 3, Wil Schilders 2,4 and Andy Wathen 1 1 Oxford University Computing Laboratory,
Case Study in Computational Science & Engineering - Lecture 5 1 Iterative Solution of Linear Systems Jacobi Method while not converged do { }
October 2008 Integrated Predictive Simulation System for Earthquake and Tsunami Disaster CREST/Japan Science and Technology Agency (JST)
Chapter 9 Gauss Elimination The Islamic University of Gaza
Dense Linear Algebra Sathish Vadhiyar. Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
ECE 530 – Analysis Techniques for Large-Scale Electrical Systems Prof. Hao Zhu Dept. of Electrical and Computer Engineering University of Illinois at Urbana-Champaign.
Outline Introduction Research Project Findings / Results
A Local Relaxation Approach for the Siting of Electrical Substations Walter Murray and Uday Shanbhag Systems Optimization Laboratory Department of Management.
1 Parallel Software for SemiDefinite Programming with Sparse Schur Complement Matrix Makoto Tokyo-Tech Katsuki Chuo University Mituhiro.
Report from LBNL TOPS Meeting TOPS/ – 2Investigators  Staff Members:  Parry Husbands  Sherry Li  Osni Marques  Esmond G. Ng 
Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University
ANSYS, Inc. Proprietary © 2004 ANSYS, Inc. Chapter 5 Distributed Memory Parallel Computing v9.0.
Divide and Conquer Algorithms Sathish Vadhiyar. Introduction  One of the important parallel algorithm models  The idea is to decompose the problem into.
Conjugate gradient iteration One matrix-vector multiplication per iteration Two vector dot products per iteration Four n-vectors of working storage x 0.
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
A Scalable Parallel Preconditioned Sparse Linear System Solver Murat ManguoğluMiddle East Technical University, Turkey Joint work with: Ahmed Sameh Purdue.
Parallel Direct Methods for Sparse Linear Systems
Hui Liu University of Calgary
A computational loop k k Integration Newton Iteration
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
Feifei Li, Ching Chang, George Kollios, Azer Bestavros
Dense Linear Algebra (Data Distributions)
A computational loop k k Integration Newton Iteration
Presentation transcript:

Scalable Stochastic Programming Cosmin G. Petra Mathematics and Computer Science Division Argonne National Laboratory Joint work with Mihai Anitescu and Miles Lubin

Motivation  Sources of uncertainty in complex energy systems –Weather –Consumer Demand –Market prices  – Anitescu, Constantinescu, Zavala –Stochastic Unit Commitment with Wind Power Generation –Energy management of Co-generation –Economic Optimization of a Building Energy System 2

Stochastic Unit Commitment with Wind Power  Wind Forecast – WRF(Weather Research and Forecasting) Model –Real-time grid-nested 24h simulation –30 samples require 1h on 500 CPUs 3 Slide courtesy of V. Zavala & E. Constantinescu Wind farm Thermal generator

Economic Optimization of a Building Energy System 4  Proactive management - temperature forecasting & electricity prices  Minimize daily energy costs Slide courtesy of V. Zavala

Optimization under Uncertainty  Two-stage stochastic programming with recourse (“here-and-now”)  5 subj. to. continuous discrete Sampling Inference Analysis M samples Sample average approximation (SAA) subj. to.

Linear Algebra of Primal-Dual Interior-Point Methods 6 subj. to. Min Convex quadratic problem IPM Linear System Two-stage SP arrow-shaped linear system (via a permutation) Multi-stage SP nested

7 The Direct Schur Complement Method (DSC)  Uses the arrow shape of H 1.Implicit factorization 2. Solving Hz=r 2.1. Back substitution 2.2. Diagonal Solve 2.3. Forward substitution

Parallelizing DSC – 1. Factorization phase 8 2. Backsolve Process 1 Process 2 Process p Process 1 Factorization of the 1 st stage Schur complement matrix = BOTTLENECK Sparse linear algebra Dense linear algebra

Parallelizing DSC – 2. Backsolve 9 Process 1 Process 2 Process p Process 1 Process 2 Process p 1 st stage backsolve = BOTTLENECK 1.Factorization Sparse linear algebra Dense linear algebra

Scalability of DSC 10 Unit commitment 76.7% efficiency but not always the case Large number of 1 st stage variables: 38.6% efficiency on Argonne

BOTTLENECK SOLUTION 1: STOCHASTIC PRECONDITIONER 11

Preconditioned Schur Complement (PSC) 12 (separate process) REMOVES the factorization bottleneck Slightly larger backsolve bottleneck

The Stochastic Preconditioner  The exact structure of C is  IID subset of n scenarios:  The stochastic preconditioner (Petra & Anitescu, 2010)  For C use the constraint preconditioner (Keller et. al., 2000) 13

The “Ugly” Unit Commitment Problem 14  DSC on P processes vs PSC on P+1 process Optimal use of PSC – linear scaling Factorization of the preconditioner can not be hidden anymore. 120 scenarios

Quality of the Stochastic Preconditioner  “Exponentially” better preconditioning (Petra & Anitescu 2010)  Proof: Hoeffding inequality  Assumptions on the problem’s random data 1.Boundedness 2.Uniform full rank of and 15 not restrictive

Quality of the Constraint Preconditioner  has an eigenvalue 1 with order of multiplicity.  The rest of the eigenvalues satisfy  Proof: based on Bergamaschi et. al.,

The Krylov Methods Used for  BiCGStab using constraint preconditioner M  Preconditioned Projected CG (PPCG) (Gould et. al., 2001) –Preconditioned projection onto the –Does not compute the basis for Instead, – 17

Performance of the preconditioner  Eigenvalues clustering & Krylov iterations  Affected by the well-known ill-conditioning of IPMs. 18

SOLUTION 2: PARALELLIZATION OF STAGE 1 LINEAR ALGEBRA 19

Parallelizing the 1 st stage linear algebra  We distribute the 1 st stage Schur complement system.  C is treated as dense.  Alternative to PSC for problems with large number of 1 st stage variables.  Removes the memory bottleneck of PSC and DSC.  We investigated ScaLapack, Elemental (successor of PLAPACK) –None have a solver for symmetric indefinite matrices (Bunch-Kaufman); –LU or Cholesky only. –So we had to think of modifying either. 20 dense symm. pos. def., sparse full rank.

ScaLapack (ORNL) 21  Classical block distribution of the matrix  Blocked “down-looking” Cholesky - algorithmic blocks  Size of algorithmic block = size of distribution block!  For cache-performance - large algorithmic blocks  For good load balancing - small distribution blocks  Must trade off cache-performance for load balancing  Communication: basic MPI calls  Inflexible in working with sub-blocks

Elemental (UT Austin) 22  Unconventional “elemental” distribution: blocks of size 1.  Size of algorithmic block size of distribution block  Both cache-performance (large alg. blocks) and load balancing (distrib. blocks of size 1)  Communication  More sophisticated MPI calls  Overhead O(log(sqrt(p))), p is the number of processors.  Sub-blocks friendly  Better performance in a hybrid approach, MPI+SMP, than ScaLapack

Cholesky-based -like factorization   Can be viewed as an “implicit” normal equations approach.  In-place implementation inside Elemental: no extra memory needed.  Idea: modify the Cholesky factorization, by changing the sign after processing p columns.  It is much easier to do in Elemental, since this distributes elements, not blocks.  Twice as fast as LU  Works for more general saddle-point linear systems, i.e., pos. semi-def. (2,2) block. 23

Distributing the 1 st stage Schur complement matrix  All processors contribute to all of the elements of the (1,1) dense block  A large amount of inter-process communication occurs.  Possibly more costly than the factorization itself.  Solution: use buffer to reduce the number of messages when doing a Reduce_scatter.  approach also reduces the communication by half – only need to send lower triangle. 24

Reduce operations 25  Streamlined copying procedure - Lubin and Petra (2010)  Loop over continuous memory and copy elements in send buffer  Avoids divisions and modulus ops needed to compute the positions  “Symmetric” reduce for  Only lower triangle is reduced  Fixed buffer size  A variable number of columns reduced.  Effectively halves the communication (both data & # of MPI calls).

Large-scale performance 26  First-stage linear algebra: ScaLapack (LU), Elemental(LU), and  Strong scaling of PIPS with and  90.1% from 64 to 1024 cores  75.4% from 64 to 2048 cores  > 4,000 scenarios SAA problem: 1 st stage variables: 82,000 Total #: 189 million Thermal units: 1,000 Wind farms: 1,200

Towards real-life models – UC with transmission constraints  Current status: ISOs (Independent system operator) uses UC with –deterministic wind profiles, market prices and demand –network (transmission) constraints –Outer 1-h timestep 24 horizon simulation –Inner 5-min timestep 1h horizon corrections  Stochastic UC with transmission constraints (V. Zavala 2010) –Stochastic wind profiles & transmission constraints –Deterministic market prices and demand –24 horizon with 1h timestep –Kirchoff’s laws are part of the constraints –The problem is huge: KKT systems are 1.8 Bil x 1.8 Bil 27 Generator Load node

Solving UC with transmission constraints  32k wind scenarios (k=1024)  32k nodes (131,072 cores) on Intrepid BG/P  Hybrid programming model: SMP inside MPI –Sparse 2 nd -stage linear algebra: WSMP (IBM) –Dense 1 st -stage linear algebra: Elemental in SMP mode  Time to solution: 20h = 4h for loading + 16h for solving  Flop count is aprox. 1% of the peak  For a 4h Horizon problem very good strong scaling 28

Real-time solution 29  Exploiting the structure  Time-coupling in the second-stage -> block tridiagonal linear systems.  Sparse backsolves must be used.  Speedup in the backsolve phase > 20x is obtained.  Preliminary tests for 24h horizon: time to solution < 1h (after loading).  Flop count does not necessarily increase.  Strong scaling does not change by much in the case of the 24h horizon problem. The structure of the KKT system for a general 2-stage SP problem The second-stage KKT systems has a block tridiagonal structure given by the time-coupling

Concluding remarks  PIPS – parallel interior-point solver for stochastic SAA problems  Specialized linear algebra layer –Small-sized 1 st -stage subproblems  DSC –Medium-sized 1 st -stage  PSC –Large-sized 1 st -stage  Distributed SC  Scenario parallelization in a hybrid programming model MPI+SMP 30

Thank you for your attention! Questions? 31