1 Stochastic Optimization of Complex Energy Systems on High-Performance Computers Cosmin G. Petra Mathematics and Computer Science Division Argonne National Laboratory SIAM CSE 2013 Joint work with Olaf Schenk(USI Lugano), Miles Lubin (MIT), Klaus Gaertner(WIAS Berlin)

2 Outline  Application of HPC to power-grid optimization under uncertainty  Parallel interior-point solver (PIPS-IPM) –structure exploiting  Revisiting linear algebra  Experiments on BG/P with the new features 2

3 Stochastic unit commitment with wind power  Wind Forecast – WRF(Weather Research and Forecasting) Model –Real-time grid-nested 24h simulation –30 samples require 1h on 500 CPUs (Jazz@Argonne) 3 Slide courtesy of V. Zavala & E. Constantinescu Wind farm Thermal generator

4 Stochastic Formulation  Discrete distribution leads to block-angular LP 4

5 Large-scale (dual) block-angular LPs 5 In terminology of stochastic LPs: First-stage variables (decision now): x 0 Second-stage variables (recourse decision) : x 1, …, x N Each diagonal block is a realization of a random variable (scenario) Extensive form

6 Computational challenges and difficulties  May require many scenarios (100s, 1,000s, 10,000s …) to accurately model uncertainty  “Large” scenarios ( W i up to 250,000 x 250,000)  “Large” 1 st stage (1,000s, 10,000s of variables)  Easy to build a practical instance that requires 100+ GB of RAM to solve  Requires distributed memory  Real-time solution needed in our applications 6

7 Linear algebra of primal-dual interior-point methods (IPM) 7 subj. to. Min Convex quadratic problem IPM Linear System Two-stage SP arrow-shaped linear system (modulo a permutation) Multi-stage SP nested N is the number of scenarios 2 solves per IPM iteration - predictor directions - corrector directions

8 Special Structure of KKT System (Arrow-shaped) 8

9 Parallel Solution Procedure for KKT System  Steps 1 and 5 trivially parallel –“Scenario-based decomposition”  Steps 1,2,3 are >95% of total execution time. 9

10 Components of Execution Time  Notice break in y-axis scale 10

11 Scenario Calculations – Steps 1 and 5  Each scenario is assigned to an MPI process, which locally performs steps 1 and 5.  Matrices are sparse and symmetric indefinite (symmetric with positive and negative eigenvalues).  Computing is very expensive when solving with the factors of against non-zero columns of and multiplying from left with  4 hours 10 minutes wall time to solve a 4h-horizon problem with 8k scenarios on 8k nodes.  Need to run under strict time requirements –For example, solve 24h-horizon problem in less than 1h 11

12 Revisiting scenario computations for shared-memory  Multiple sparse right-hand sides  Triangular solves phase hard to parallelize in shared-memory (multi-core)  Factorization phase speeds up very well and achieves considerable peak- performance  Our approach: incomplete factorization of  Stop factorization after the elimination of (1,1) block  will sit in the (2,2) block (Schur complement) 12

13 Implementation  Requires modification of the linear solver  PARDISO (Schenk) -> PARDISO-SC  Pivot perturbations during factorization needed to maintain numerical stability  Errors due to perturbations are absorbed by iterative refinement  This would be extremely expensive in our case (many right-hand sides)  We let errors propagate in the “global” Schur complement C (Step 2)  Factorize the perturbed C (denoted by ) (Step 3)  After Step 1, 2 and 3, we have the factorization of an approximation matrix 13

14 Pivot error absorption by preconditioned BiCGStab 14  Still we have to solve with  “Absorb errors” by solving Kz=r using preconditioned BiCGStab –Numerical experiments showed it is more robust than iterative refinement.  Preconditioner is  Each BiCGStab iteration requires –2 mat-vecs: Kz –2 applications of the preconditioner:  One application of the preconditioner resumes to performing “solve” steps 4 and 5 for

15 Summary of the new approach 15

16 Test architecture  “Intrepid” Blue Gene/P supercomputer –40,960 nodes –Custom interconnect –Each node has quad-core 850 Mhz PowerPC processor, 2 GB RAM  DOE INCITE Award 2012-2013 – 24 million core hours 16

17 Numerical experiments  4h (UC4), 12h(UC12), 24h(UC24) horizon problems  1 scenario per node (4 cores per scenario)  Large-scale: 12h horizon, up to 32k scenarios and 128k cores (k=1,024) –16k scenarios – 2.08 billion variables, 1.81 billion constraints, KKT system size = 3.89 billion  LAPACK+SMP ESSL BLAS for first-stage linear systems  PARDISO-SC for second-stage linear systems 17

18 Compute SC Times 18

19 Time per IPM iteration  UC12, 32k scenarios, 32k nodes (128k cores)  BiCGStab iteration count ranges from 0 to 1.5  Cost of absorbing factorization perturbation errors is between 10 and 30% of total iteration cost 19

20 Solve to completion – UC12 Nodes/scensWall time (sec)IPM IterationsTime per IPM iteration (sec) 4096 3548.510333.57 8192 3883.711234.67 163844208.812334.80 327684781.713335.95 20  Before: 4 hours 10 minutes wall time to solve UC4 problem with 8k scenarios on 8k nodes  Now: UC12

21 Weak scaling 21

22 Strong scaling 22

23 Conclusions and Future Considerations  Multicore-friendly reformulation of sparse linear algebra computations lead to one order of magnitude faster execution times.  Fast factorization-based computation of SC  Robust and cheap pivot errors absorption via Krylov iterative methods  Parallel efficiency of PIPS remains good.  Performance evaluation on today’s supercomputers –IBM BG/Q –Cray XK7, XC30 23

