Scalable Stochastic Programming Cosmin G. Petra Mathematics and Computer Science Division Argonne National Laboratory Joint work with.

Slides:



Advertisements
Similar presentations
Dense Linear Algebra (Data Distributions) Sathish Vadhiyar.
Advertisements

Communication costs of LU decomposition algorithms for banded matrices Razvan Carbunescu 12/02/20111.
Load Balancing Parallel Applications on Heterogeneous Platforms.
Parallel Jacobi Algorithm Steven Dong Applied Mathematics.
MATH 224 – Discrete Mathematics
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
Electrical and Computer Engineering Mississippi State University
Crew Scheduling Housos Efthymios, Professor Computer Systems Laboratory (CSL) Electrical & Computer Engineering University of Patras.
Graph Laplacian Regularization for Large-Scale Semidefinite Programming Kilian Weinberger et al. NIPS 2006 presented by Aggeliki Tsoli.
Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 19 Scheduling IV.
MATH 685/ CSI 700/ OR 682 Lecture Notes
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
All Hands Meeting, 2006 Title: Grid Workflow Scheduling in WOSE (Workflow Optimisation Services for e- Science Applications) Authors: Yash Patel, Andrew.
Scalable Stochastic Programming Cosmin Petra and Mihai Anitescu Mathematics and Computer Science Division Argonne National Laboratory Informs Computing.
Numerical Algorithms Matrix multiplication
Applications of Stochastic Programming in the Energy Industry Chonawee Supatgiat Research Group Enron Corp. INFORMS Houston Chapter Meeting August 2, 2001.
Stochastic Optimization of Complex Energy Systems on High-Performance Computers Cosmin G. Petra Mathematics and Computer Science Division Argonne National.
1cs542g-term Notes  Assignment 1 will be out later today (look on the web)
1cs542g-term Notes  Assignment 1 is out (questions?)
1cs542g-term Notes  Assignment 1 is out (due October 5)  Matrix storage: usually column-major.
Scalable Multi-Stage Stochastic Programming Cosmin Petra and Mihai Anitescu Mathematics and Computer Science Division Argonne National Laboratory DOE Applied.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
ISPDC 2007, Hagenberg, Austria, 5-8 July On Grid-based Matrix Partitioning for Networks of Heterogeneous Processors Alexey Lastovetsky School of.
Distributed Combinatorial Optimization
Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.
ECE 530 – Analysis Techniques for Large-Scale Electrical Systems
On the convergence of SDDP and related algorithms Speaker: Ziming Guan Supervisor: A. B. Philpott Sponsor: Fonterra New Zealand.
Higher-order Confidence Intervals for Stochastic Programming using Bootstrapping Cosmin G. Petra Joint work with Mihai Anitescu Mathematics and Computer.
On parallelizing dual decomposition in stochastic integer programming
1 1.1 © 2012 Pearson Education, Inc. Linear Equations in Linear Algebra SYSTEMS OF LINEAR EQUATIONS.
1 Capacity Planning under Uncertainty Capacity Planning under Uncertainty CHE: 5480 Economic Decision Making in the Process Industry Prof. Miguel Bagajewicz.
Scalable Stochastic Programming Cosmin G. Petra Mathematics and Computer Science Division Argonne National Laboratory Joint work with.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
MUMPS A Multifrontal Massively Parallel Solver IMPLEMENTATION Distributed multifrontal.
Optimization Under Uncertainty: Structure-Exploiting Algorithms Victor M. Zavala Assistant Computational Mathematician Mathematics and Computer Science.
Calculating Discrete Logarithms John Hawley Nicolette Nicolosi Ryan Rivard.
Solving Scalar Linear Systems Iterative approach Lecture 15 MA/CS 471 Fall 2003.
Scalabilities Issues in Sparse Factorization and Triangular Solution Sherry Li Lawrence Berkeley National Laboratory Sparse Days, CERFACS, June 23-24,
High Performance Solvers for Semidefinite Programs
Optimization for Operation of Power Systems with Performance Guarantee
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
Scalable Multi-Stage Stochastic Programming
1 Using the PETSc Parallel Software library in Developing MPP Software for Calculating Exact Cumulative Reaction Probabilities for Large Systems (M. Minkoff.
CSC2535: Computation in Neural Networks Lecture 11: Conditional Random Fields Geoffrey Hinton.
ParCFD Parallel computation of pollutant dispersion in industrial sites Julien Montagnier Marc Buffat David Guibert.
Network-related problems in M2ACS Mihai Anitescu.
Stochastic optimization of energy systems Cosmin Petra Argonne National Laboratory.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Javad Lavaei Department of Electrical Engineering Columbia University Joint work with Somayeh Sojoudi and Ramtin Madani An Efficient Computational Method.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
1 Efficient Parallel Software for Large-Scale Semidefinite Programs Makoto Tokyo-Tech Katsuki Chuo University MSC Yokohama.
Computing Simulation in Orders Based Transparent Parallelizing Pavlenko Vitaliy Danilovich, Odessa National Polytechnic University Burdeinyi Viktor Viktorovych,
Case Study in Computational Science & Engineering - Lecture 5 1 Iterative Solution of Linear Systems Jacobi Method while not converged do { }
October 2008 Integrated Predictive Simulation System for Earthquake and Tsunami Disaster CREST/Japan Science and Technology Agency (JST)
Dense Linear Algebra Sathish Vadhiyar. Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
ECE 530 – Analysis Techniques for Large-Scale Electrical Systems Prof. Hao Zhu Dept. of Electrical and Computer Engineering University of Illinois at Urbana-Champaign.
Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University
ECE 530 – Analysis Techniques for Large-Scale Electrical Systems Prof. Hao Zhu Dept. of Electrical and Computer Engineering University of Illinois at Urbana-Champaign.
ANSYS, Inc. Proprietary © 2004 ANSYS, Inc. Chapter 5 Distributed Memory Parallel Computing v9.0.
Divide and Conquer Algorithms Sathish Vadhiyar. Introduction  One of the important parallel algorithm models  The idea is to decompose the problem into.
Conjugate gradient iteration One matrix-vector multiplication per iteration Two vector dot products per iteration Four n-vectors of working storage x 0.
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University
A Scalable Parallel Preconditioned Sparse Linear System Solver Murat ManguoğluMiddle East Technical University, Turkey Joint work with: Ahmed Sameh Purdue.
Hui Liu University of Calgary
A computational loop k k Integration Newton Iteration
Dense Linear Algebra (Data Distributions)
A computational loop k k Integration Newton Iteration
Presentation transcript:

Scalable Stochastic Programming Cosmin G. Petra Mathematics and Computer Science Division Argonne National Laboratory Joint work with Mihai Anitescu, Miles Lubin and Victor Zavala

Motivation  Sources of uncertainty in complex energy systems –Weather –Consumer Demand –Market prices  – Anitescu, Constantinescu, Zavala –Stochastic Unit Commitment with Wind Power Generation –Energy management of Co-generation –Economic Optimization of a Building Energy System 2

Stochastic Unit Commitment with Wind Power  Wind Forecast – WRF(Weather Research and Forecasting) Model –Real-time grid-nested 24h simulation –30 samples require 1h on 500 CPUs 3 Slide courtesy of V. Zavala & E. Constantinescu Wind farm Thermal generator Thermal Units Schedule? Minimize Cost Satisfy Demand/Adopt wind power Have a Reserve Technological constraints

Optimization under Uncertainty  Two-stage stochastic programming with recourse (“here-and-now”)  4 subj. to. continuous discrete Sampling Statistical Inference M batches Sample average approximation (SAA) subj. to.

Solving the SAA problem – PIPS solver  Interior-point methods (IPMs) –Polynomial iteration complexity: (in theory) –IPMs perform better in practice (infeasible primal-dual path-following) –No more than iterations have been observed for n less than 10 million –We can confirm that this is still true for n being hundred times larger –Two linear systems solved at each iteration –Direct solvers needs to be used because IPMs linear systems are ill-conditioned and needs to be solved accurately –We solve the SAA problems with a standard IPM (Mehrotra’s predictor-corrector) and specialized linear algebra –PIPS solver Go to ”Insert (View) | Header and Footer" to add your organization, sponsor, meeting name here; then, click "Apply to All" 5

Linear Algebra of Primal-Dual Interior-Point Methods 6 subj. to. Min Convex quadratic problem IPM Linear System Two-stage SP arrow-shaped linear system (modulo a permutation) Multi-stage SP nested S is the number of scenarios

7 The Direct Schur Complement Method (DSC)  Uses the arrow shape of H 1.Implicit factorization 2. Solving Hz=r 2.1. Backward substitution 2.2. Diagonal Solve 2.3. Forward substitution

Parallelizing DSC – 1. Factorization phase 8 2. Triangular solves Process 1 Process 2 Process p Process 1 Factorization of the 1 st stage Schur complement matrix = BOTTLENECK Sparse linear algebra MA57 Dense linear algebra LAPACK

Parallelizing DSC – 2. Triangular solves 9 Process 1 Process 2 Process p Process 1 Process 2 Process p 1 st stage backsolve = BOTTLENECK 1.Factorization Sparse linear algebra Dense linear algebra

Implementation of DSC Go to ”Insert (View) | Header and Footer" to add your organization, sponsor, meeting name here; then, click "Apply to All" 10 FactBacksolves FactBacksolves FactBacksolves CommComm Dense factbacksolve MPI_Allreduce CommComm forw.subst.Dense solve Dense factbacksolve Dense factbacksolve forw.subst.Dense solve forw.subst.Dense solve FactorizationTriangular solves Proc 1 Proc 2 Proc p Computations are replicated on each process.

Scalability of DSC 11 Unit commitment 76.7% efficiency but not always the case Large number of 1 st stage variables: 38.6% efficiency on Argonne

BOTTLENECK SOLUTION 1: STOCHASTIC PRECONDITIONER 12

The Stochastic Preconditioner  The exact structure of C is  IID subset of n scenarios:  The stochastic preconditioner (P. & Anitescu, in COAP 2011)  For C use the constraint preconditioner (Keller et. al., 2000) 13

Implementation of PSC 14 CommComm MPI_Reduce (to proc p+1) F.B. FactBacksolves Dense fact of prcnd. CommComm MPI_Allreduce backsolve Backsolve CommComm MPI_Reduce (to proc 1) Krylov solve Prcnd tri. slv. comm forw.subst. CommComm MPI_Bcast Proc 1 Proc 2 Proc p Proc p+1 Proc 1 Proc 2 Proc p Proc p+1 REMOVES the factorization bottleneck Slightly larger solve bottleneck F.B. F.B. F.B. FactBacksolves FactBacksolves idle

The “Ugly” Unit Commitment Problem 15  DSC on P processes vs PSC on P+1 process Optimal use of PSC – linear scaling Factorization of the preconditioner can not be hidden anymore. 120 scenarios

Quality of the Stochastic Preconditioner  “Exponentially” better preconditioning (P. & Anitescu, 2011)  Proof: Hoeffding inequality  Assumptions on the problem’s random data 1.Boundedness 2.Uniform full rank of and 16 not restrictive

Quality of the Constraint Preconditioner  has an eigenvalue 1 with order of multiplicity.  The rest of the eigenvalues satisfy  Proof: based on Bergamaschi et. al.,

The Krylov Methods Used for  BiCGStab using constraint preconditioner M  Preconditioned Projected CG (PPCG) (Gould et. al., 2001) –Preconditioned projection onto the –Does not compute the basis for Instead, – 18

Performance of the preconditioner  Eigenvalues clustering & Krylov iterations  Affected by the well-known ill-conditioning of IPMs. 19

SOLUTION 2: PARALELLIZATION OF STAGE 1 LINEAR ALGEBRA 20

Parallelizing the 1 st stage linear algebra  We distribute the 1 st stage Schur complement system.  C is treated as dense.  Alternative to PSC for problems with large number of 1 st stage variables.  Removes the memory bottleneck of PSC and DSC.  We investigated ScaLapack, Elemental (successor of PLAPACK) –None have a solver for symmetric indefinite matrices (Bunch-Kaufman); –LU or Cholesky only. –So we had to think of modifying either. 21 dense symm. pos. def., sparse full rank.

Cholesky-based -like factorization   Can be viewed as an “implicit” normal equations approach.  In-place implementation inside Elemental: no extra memory needed.  Idea: modify the Cholesky factorization, by changing the sign after processing p columns.  It is much easier to do in Elemental, since this distributes elements, not blocks.  Twice as fast as LU  Works for more general saddle-point linear systems, i.e., pos. semi-def. (2,2) block. 22

Distributing the 1 st stage Schur complement matrix  All processors contribute to all of the elements of the (1,1) dense block  A large amount of inter-process communication occurs.  Each term is too big to fit in a node’s memory.  Possibly more costly than the factorization itself.  Solution: collective MPI_Reduce_scatter calls Reduce (sum) terms, then partition and send to destination (scatter) Need to reorder (pack) elements to match matrix distribution Columns of the Schur complement matrix are distributed as they are calculated 23

DSC with distributed first-stage Go to ”Insert (View) | Header and Footer" to add your organization, sponsor, meeting name here; then, click "Apply to All" 24 FactBacksolves CommComm backsolve MPI_Reduce_scatter CommComm MPI_Allreduce forw.subst. backsolve forw.subst. Proc 1 Proc 2 Proc p ELEMENTALELEMENTAL ELEMENTALELEMENTAL FactBacksolves FactBacksolves Schur complement matrix is computed and reduced block-wise. (B blocks of columns ) For each b=1:B

Reduce operations 25  Streamlined copying procedure - Lubin and Petra (2010)  Loop over continuous memory and copy elements in send buffer  Avoids divisions and modulus ops needed to compute the positions  “Symmetric” reduce for  Only lower triangle is reduced  Fixed buffer size  A variable number of columns reduced.  Effectively halves the communication (both data & # of MPI calls).

Large-scale performance 26  First-stage linear algebra: ScaLapack (LU), Elemental(LU), and  Strong scaling of PIPS with and  90.1% from 64 to 1024 cores  75.4% from 64 to 2048 cores  > 4,000 scenarios  On Fusion  Lubin, P., Anitescu, in OMS 2011 SAA problem: 1 st stage variables: 82,000 Total #: 189 million Thermal units: 1,000 Wind farms: 1,200

Towards real-life models – Economic dispatch with transmission constraints  Current status: ISOs (Independent system operator) use –deterministic wind profiles, market prices and demand –network (transmission) constraints –Outer 1-h timestep 24 horizon simulation –Inner 5-min timestep 1h horizon corrections  Stochastic ED with transmission constraints (V. Zavala et. al. 2010) –Stochastic wind profiles & transmission constraints –Deterministic market prices and demand –24 horizon with 1h timestep –Kirchoff’s laws are part of the constraints –The problem is huge: KKT systems are 1.8 Bil x 1.8 Bil 27 Generator Load node (bus)

Solving ED with transmission constraints on Intrepid BG/P  32k wind scenarios (k=1024)  32k nodes (131,072 cores) on Intrepid BG/P  Hybrid programming model: SMP inside MPI –Sparse 2 nd -stage linear algebra: WSMP (IBM) –Dense 1 st -stage linear algebra: Elemental with SMP BLAS + OpenMP for packing/unpacking buffer.  For a 4h Horizon problem very good strong scaling  Lubin, P., Anitescu, Zavala – in proceedings of SC

Stochastic programming – a scalable computation pattern  Scenario parallelization in a hybrid programming model MPI+SMP –DSC, PSC (1 st stage < 10,000 variables)  Hybrid MPI/SMP running on Blue Gene/P –131k cores (96% strong scaling) for Illinois ED problem with grid constraints. 2B variables, maybe largest ever solved?  Close to real-time solutions (24 hr horizon in 1 hr wallclock) –Further development needed, since users aim for More uncertainty, more detail (x 10) Faster Dynamics  Shorter Decision Window (x 10) Longer Horizons (California == 72 hours) (x 3) Go to ”Insert (View) | Header and Footer" to add your organization, sponsor, meeting name here; then, click "Apply to All" 29

Thank you for your attention! Questions? 30