Download presentation
Presentation is loading. Please wait.
Published byMeredith Wilson Modified over 9 years ago
1
Scalable Stochastic Programming Cosmin G. Petra Mathematics and Computer Science Division Argonne National Laboratory petra@mcs.anl.gov Joint work with Mihai Anitescu and Miles Lubin
2
Motivation Sources of uncertainty in complex energy systems –Weather –Consumer Demand –Market prices Applications @Argonne – Anitescu, Constantinescu, Zavala –Stochastic Unit Commitment with Wind Power Generation –Energy management of Co-generation –Economic Optimization of a Building Energy System 2
3
Stochastic Unit Commitment with Wind Power Wind Forecast – WRF(Weather Research and Forecasting) Model –Real-time grid-nested 24h simulation –30 samples require 1h on 500 CPUs (Jazz@Argonne) 3 Slide courtesy of V. Zavala & E. Constantinescu Wind farm Thermal generator
4
Economic Optimization of a Building Energy System 4 Proactive management - temperature forecasting & electricity prices Minimize daily energy costs Slide courtesy of V. Zavala
5
Optimization under Uncertainty Two-stage stochastic programming with recourse (“here-and-now”) 5 subj. to. continuous discrete Sampling Inference Analysis M samples Sample average approximation (SAA) subj. to.
6
Linear Algebra of Primal-Dual Interior-Point Methods 6 subj. to. Min Convex quadratic problem IPM Linear System Two-stage SP arrow-shaped linear system (via a permutation) Multi-stage SP nested
7
7 The Direct Schur Complement Method (DSC) Uses the arrow shape of H 1.Implicit factorization 2. Solving Hz=r 2.1. Back substitution 2.2. Diagonal Solve 2.3. Forward substitution
8
Parallelizing DSC – 1. Factorization phase 8 2. Backsolve Process 1 Process 2 Process p Process 1 Factorization of the 1 st stage Schur complement matrix = BOTTLENECK Sparse linear algebra Dense linear algebra
9
Parallelizing DSC – 2. Backsolve 9 Process 1 Process 2 Process p Process 1 Process 2 Process p 1 st stage backsolve = BOTTLENECK 1.Factorization Sparse linear algebra Dense linear algebra
10
Scalability of DSC 10 Unit commitment 76.7% efficiency but not always the case Large number of 1 st stage variables: 38.6% efficiency on Fusion @ Argonne
11
BOTTLENECK SOLUTION 1: STOCHASTIC PRECONDITIONER 11
12
Preconditioned Schur Complement (PSC) 12 (separate process) REMOVES the factorization bottleneck Slightly larger backsolve bottleneck
13
The Stochastic Preconditioner The exact structure of C is IID subset of n scenarios: The stochastic preconditioner (Petra & Anitescu, 2010) For C use the constraint preconditioner (Keller et. al., 2000) 13
14
The “Ugly” Unit Commitment Problem 14 DSC on P processes vs PSC on P+1 process Optimal use of PSC – linear scaling Factorization of the preconditioner can not be hidden anymore. 120 scenarios
15
Quality of the Stochastic Preconditioner “Exponentially” better preconditioning (Petra & Anitescu 2010) Proof: Hoeffding inequality Assumptions on the problem’s random data 1.Boundedness 2.Uniform full rank of and 15 not restrictive
16
Quality of the Constraint Preconditioner has an eigenvalue 1 with order of multiplicity. The rest of the eigenvalues satisfy Proof: based on Bergamaschi et. al., 2004. 16
17
The Krylov Methods Used for BiCGStab using constraint preconditioner M Preconditioned Projected CG (PPCG) (Gould et. al., 2001) –Preconditioned projection onto the –Does not compute the basis for Instead, – 17
18
Performance of the preconditioner Eigenvalues clustering & Krylov iterations Affected by the well-known ill-conditioning of IPMs. 18
19
SOLUTION 2: PARALELLIZATION OF STAGE 1 LINEAR ALGEBRA 19
20
Parallelizing the 1 st stage linear algebra We distribute the 1 st stage Schur complement system. C is treated as dense. Alternative to PSC for problems with large number of 1 st stage variables. Removes the memory bottleneck of PSC and DSC. We investigated ScaLapack, Elemental (successor of PLAPACK) –None have a solver for symmetric indefinite matrices (Bunch-Kaufman); –LU or Cholesky only. –So we had to think of modifying either. 20 dense symm. pos. def., sparse full rank.
21
ScaLapack (ORNL) 21 Classical block distribution of the matrix Blocked “down-looking” Cholesky - algorithmic blocks Size of algorithmic block = size of distribution block! For cache-performance - large algorithmic blocks For good load balancing - small distribution blocks Must trade off cache-performance for load balancing Communication: basic MPI calls Inflexible in working with sub-blocks
22
Elemental (UT Austin) 22 Unconventional “elemental” distribution: blocks of size 1. Size of algorithmic block size of distribution block Both cache-performance (large alg. blocks) and load balancing (distrib. blocks of size 1) Communication More sophisticated MPI calls Overhead O(log(sqrt(p))), p is the number of processors. Sub-blocks friendly Better performance in a hybrid approach, MPI+SMP, than ScaLapack
23
Cholesky-based -like factorization Can be viewed as an “implicit” normal equations approach. In-place implementation inside Elemental: no extra memory needed. Idea: modify the Cholesky factorization, by changing the sign after processing p columns. It is much easier to do in Elemental, since this distributes elements, not blocks. Twice as fast as LU Works for more general saddle-point linear systems, i.e., pos. semi-def. (2,2) block. 23
24
Distributing the 1 st stage Schur complement matrix All processors contribute to all of the elements of the (1,1) dense block A large amount of inter-process communication occurs. Possibly more costly than the factorization itself. Solution: use buffer to reduce the number of messages when doing a Reduce_scatter. approach also reduces the communication by half – only need to send lower triangle. 24
25
Reduce operations 25 Streamlined copying procedure - Lubin and Petra (2010) Loop over continuous memory and copy elements in send buffer Avoids divisions and modulus ops needed to compute the positions “Symmetric” reduce for Only lower triangle is reduced Fixed buffer size A variable number of columns reduced. Effectively halves the communication (both data & # of MPI calls).
26
Large-scale performance 26 First-stage linear algebra: ScaLapack (LU), Elemental(LU), and Strong scaling of PIPS with and 90.1% from 64 to 1024 cores 75.4% from 64 to 2048 cores > 4,000 scenarios SAA problem: 1 st stage variables: 82,000 Total #: 189 million Thermal units: 1,000 Wind farms: 1,200
27
Towards real-life models – UC with transmission constraints Current status: ISOs (Independent system operator) uses UC with –deterministic wind profiles, market prices and demand –network (transmission) constraints –Outer 1-h timestep 24 horizon simulation –Inner 5-min timestep 1h horizon corrections Stochastic UC with transmission constraints (V. Zavala 2010) –Stochastic wind profiles & transmission constraints –Deterministic market prices and demand –24 horizon with 1h timestep –Kirchoff’s laws are part of the constraints –The problem is huge: KKT systems are 1.8 Bil x 1.8 Bil 27 Generator Load node
28
Solving UC with transmission constraints 32k wind scenarios (k=1024) 32k nodes (131,072 cores) on Intrepid BG/P Hybrid programming model: SMP inside MPI –Sparse 2 nd -stage linear algebra: WSMP (IBM) –Dense 1 st -stage linear algebra: Elemental in SMP mode Time to solution: 20h = 4h for loading + 16h for solving Flop count is aprox. 1% of the peak For a 4h Horizon problem very good strong scaling 28
29
Real-time solution 29 Exploiting the structure Time-coupling in the second-stage -> block tridiagonal linear systems. Sparse backsolves must be used. Speedup in the backsolve phase > 20x is obtained. Preliminary tests for 24h horizon: time to solution < 1h (after loading). Flop count does not necessarily increase. Strong scaling does not change by much in the case of the 24h horizon problem. The structure of the KKT system for a general 2-stage SP problem The second-stage KKT systems has a block tridiagonal structure given by the time-coupling
30
Concluding remarks PIPS – parallel interior-point solver for stochastic SAA problems Specialized linear algebra layer –Small-sized 1 st -stage subproblems DSC –Medium-sized 1 st -stage PSC –Large-sized 1 st -stage Distributed SC Scenario parallelization in a hybrid programming model MPI+SMP 30
31
Thank you for your attention! Questions? 31
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.