Computational Intelligence Winter Term 2012/13 Prof. Dr. Günter Rudolph Lehrstuhl für Algorithm Engineering (LS 11) Fakultät für Informatik TU Dortmund.

Slides:



Advertisements
Similar presentations
Using Matrices in Real Life
Advertisements

© Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems Introduction.
1 Vorlesung Informatik 2 Algorithmen und Datenstrukturen (Parallel Algorithms) Robin Pomplun.
Lower Bounds for Local Search by Quantum Arguments Scott Aaronson.
Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden.
STATISTICS HYPOTHESES TEST (II) One-sample tests on the mean and variance Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering National.
Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.
Dynamic Programming Introduction Prof. Muhammad Saeed.
October 17, 2005 Copyright© Erik D. Demaine and Charles E. Leiserson L2.1 Introduction to Algorithms 6.046J/18.401J LECTURE9 Randomly built binary.
Introduction to Algorithms 6.046J/18.401J/SMA5503
© 2001 by Charles E. Leiserson Introduction to AlgorithmsDay 17 L9.1 Introduction to Algorithms 6.046J/18.401J/SMA5503 Lecture 9 Prof. Charles E. Leiserson.
©2001 by Charles E. Leiserson Introduction to AlgorithmsDay 9 L6.1 Introduction to Algorithms 6.046J/18.401J/SMA5503 Lecture 6 Prof. Erik Demaine.
Thursday, March 7 Duality 2 – The dual problem, in general – illustrating duality with 2-person 0-sum game theory Handouts: Lecture Notes.
Fakultät für informatik informatik 12 technische universität dortmund Classical scheduling algorithms for periodic systems Peter Marwedel TU Dortmund,
Computational Intelligence Winter Term 2009/10 Prof. Dr. Günter Rudolph Lehrstuhl für Algorithm Engineering (LS 11) Fakultät für Informatik TU Dortmund.
Correctness of Gossip-Based Membership under Message Loss Maxim GurevichIdit Keidar Technion.
Evaluating Window Joins over Unbounded Streams Author: Jaewoo Kang, Jeffrey F. Naughton, Stratis D. Viglas University of Wisconsin-Madison CS Dept. Presenter:
1 Outline relationship among topics secrets LP with upper bounds by Simplex method basic feasible solution (BFS) by Simplex method for bounded variables.
Computational Intelligence Winter Term 2011/12 Prof. Dr. Günter Rudolph Lehrstuhl für Algorithm Engineering (LS 11) Fakultät für Informatik TU Dortmund.
1 The tiling algorithm Learning in feedforward layered networks: the tiling algorithm writed by Marc M é zard and Jean-Pierre Nadal.
Randomized Algorithms Randomized Algorithms CS648 1.
David Luebke 1 6/7/2014 ITCS 6114 Skip Lists Hashing.
Hash Tables.
Computational Intelligence Winter Term 2011/12 Prof. Dr. Günter Rudolph Lehrstuhl für Algorithm Engineering (LS 11) Fakultät für Informatik TU Dortmund.
5-1 Chapter 5 Theory & Problems of Probability & Statistics Murray R. Spiegel Sampling Theory.
Computational Intelligence Winter Term 2009/10 Prof. Dr. Günter Rudolph Lehrstuhl für Algorithm Engineering (LS 11) Fakultät für Informatik TU Dortmund.
Computational Intelligence Winter Term 2010/11 Prof. Dr. Günter Rudolph Lehrstuhl für Algorithm Engineering (LS 11) Fakultät für Informatik TU Dortmund.
CSE 4101/5101 Prof. Andy Mirzaian. Lists Move-to-Front Search Trees Binary Search Trees Multi-Way Search Trees B-trees Splay Trees Trees Red-Black.
© 2012 National Heart Foundation of Australia. Slide 2.
Computational Intelligence Winter Term 2013/14 Prof. Dr. Günter Rudolph Lehrstuhl für Algorithm Engineering (LS 11) Fakultät für Informatik TU Dortmund.
. Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters.
Dantzig-Wolfe Decomposition
12 System of Linear Equations Case Study
CPSC 322, Lecture 14Slide 1 Local Search Computer Science cpsc322, Lecture 14 (Textbook Chpt 4.8) Oct, 5, 2012.
Computational Intelligence Winter Term 2011/12 Prof. Dr. Günter Rudolph Lehrstuhl für Algorithm Engineering (LS 11) Fakultät für Informatik TU Dortmund.
Computational Intelligence Winter Term 2013/14 Prof. Dr. Günter Rudolph Lehrstuhl für Algorithm Engineering (LS 11) Fakultät für Informatik TU Dortmund.
Computational Intelligence Winter Term 2014/15 Prof. Dr. Günter Rudolph Lehrstuhl für Algorithm Engineering (LS 11) Fakultät für Informatik TU Dortmund.
Testing Hypotheses About Proportions
Mani Srivastava UCLA - EE Department Room: 6731-H Boelter Hall Tel: WWW: Copyright 2003.
Evolution strategies Chapter 4.
9. Two Functions of Two Random Variables
Computational Intelligence Winter Term 2013/14 Prof. Dr. Günter Rudolph Lehrstuhl für Algorithm Engineering (LS 11) Fakultät für Informatik TU Dortmund.
Computational Intelligence
Epp, section 10.? CS 202 Aaron Bloomfield
Chapter 5 The Mathematics of Diversification
Copyright © Cengage Learning. All rights reserved.
Statistically motivated quantifiers GUHA matrices/searches can be seen from a statistical point of view, too. We may ask ‘Is the coincidence of two predicates.
NON - zero sum games.
Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
Optimization 吳育德.
Computational Intelligence Winter Term 2011/12 Prof. Dr. Günter Rudolph Lehrstuhl für Algorithm Engineering (LS 11) Fakultät für Informatik TU Dortmund.
The loss function, the normal equation,
Computational Intelligence Winter Term 2013/14 Prof. Dr. Günter Rudolph Lehrstuhl für Algorithm Engineering (LS 11) Fakultät für Informatik TU Dortmund.
Numerical Optimization
Function Optimization Newton’s Method. Conjugate Gradients
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
Markov Decision Processes1 Definitions; Stationary policies; Value improvement algorithm, Policy improvement algorithm, and linear programming for discounted.
Boltzmann Machine (BM) (§6.4) Hopfield model + hidden nodes + simulated annealing BM Architecture –a set of visible nodes: nodes can be accessed from outside.
Neural and Evolutionary Computing - Lecture 6
Chapter 2-OPTIMIZATION G.Anuradha. Contents Derivative-based Optimization –Descent Methods –The Method of Steepest Descent –Classical Newton’s Method.
Diversity Loss in General Estimation of Distribution Algorithms J. L. Shapiro PPSN (Parallel Problem Solving From Nature) ’06 BISCuit 2 nd EDA Seminar.
Computational Intelligence Winter Term 2015/16 Prof. Dr. Günter Rudolph Lehrstuhl für Algorithm Engineering (LS 11) Fakultät für Informatik TU Dortmund.
Computational Intelligence
Computational Intelligence
Computational Intelligence
Computational Intelligence
The loss function, the normal equation,
Mathematical Foundations of BME Reza Shadmehr
Computational Intelligence
Computational Intelligence
Presentation transcript:

Computational Intelligence Winter Term 2012/13 Prof. Dr. Günter Rudolph Lehrstuhl für Algorithm Engineering (LS 11) Fakultät für Informatik TU Dortmund

Lecture 12 G. Rudolph: Computational Intelligence Winter Term 2012/ Evolutionary Algorithms have been developed heuristically. 2.No proofs of convergence have been derived for them. 3.Sometimes the rate of convergence can be very slow. Evolutionary Algorithms: State of the art in 1970 main arguments against EA in R n : what can be done? ) disable arguments! ad 1) not really an argument against EAs … EAs use principles of biological evolution as pool of inspiration purposely: - to overcome traditional lines of thought - to get new classes of optimization algorithms ) the new ideas may be bad or good … ) necessity to analyze them!

Lecture 12 G. Rudolph: Computational Intelligence Winter Term 2012/13 3 stochastic convergence empirical convergence On the notion of convergence (I) frequent observation: N runs on some test problem / averaging / comparison ) this proves nothing! - no guarantee that behavior stable in the limit! - N lucky runs possible - etc.

Lecture 12 G. Rudolph: Computational Intelligence Winter Term 2012/13 4 D k = | f(X k ) – f* | 0 is a random variable we shall consider the stochastic sequence D 0, D 1, D 2, … On the notion of convergence (II) Does the stochastic sequence (D k ) k0 converge to 0? If so, then evidently convergence to optimum! But: there are many modes of stochastic convergence! therefore here only the most frequently used... notation: P (t) = population at time step t 0, f b (P(t)) = min{ f(x): x 2 P(t) } formal approach necessary:

Lecture 12 G. Rudolph: Computational Intelligence Winter Term 2012/13 5 Definition Let D t = |f b (P(t)) – f*| 0. We say: The EA (a) converges completely to the optimum, if 8 > 0 (b) converges almost surely or with probability 1 (w.p. 1) to the optimum, if (c) converges in probability to the optimum, if 8 > 0 (a) converges in mean to the optimum, if 8 > 0 On the notion of convergence (III)

Lecture 12 G. Rudolph: Computational Intelligence Winter Term 2012/13 6 Lemma (a) ) (b) ) (c). (d) ) (c). If 9 K < 1 : 8 t 0 : D t K, then (d), (c). If (D t ) t0 stochastically independent sequence, then (a), (b). Typical modus operandi: 1.Show convergence in probablity (c). Easy! (in most cases) 2.Show that convergence fast enough (a). This also implies (b). 3.Sequence bounded from above? This implies (d). Relationships between modes of convergence

Lecture 12 G. Rudolph: Computational Intelligence Winter Term 2012/13 7 Let (X k ) k1 be sequence of independent random variables. distribution: 1. ) convergence in probability (c) 2. ) convergence too slow! Consequently, no complete convergence! 3. Note: Hence: sequence bounded with K = 1. since convergence in prob. (c) and bounded ) convergence in mean (d) Examples (I)

Lecture 12 G. Rudolph: Computational Intelligence Winter Term 2012/13 8 let (X k ) k1 be sequence of independent random variables. distribution:(a)(c)(d) (–)(+)(+) (+)(+)(+) (–)(+)(–) (+)(+)(+) (–)(+)(–) (+)(+)(–) Examples (II)

Lecture 12 G. Rudolph: Computational Intelligence Winter Term 2012/13 9 ad 2) no convergence proofs! timeline of theoretical work on convergence 1971 – 1975Rechenberg / Schwefelconvergence rates for simple problems 1976 – 1980Bornconvergence proof for EA with genetic load 1981 – 1985Rapplconvergence proof for (1+1)-EA in R n 1986 – 1989Beyerconvergence rates for simple problems all publications in German and for EAs in R n ) results only known to German-speaking EA nerds!

Lecture 12 G. Rudolph: Computational Intelligence Winter Term 2012/13 10 ad 2) no convergence proofs! timeline of theoretical work on convergence 1989Eibena.s. convergence for elitist GA 1992Nix/VoseMarkov chain model of simple GA 1993Fogela.s. convergence of EP (Markov chain based) 1994Rudolpha.s. convergence of elitist GA non-convergence of simple GA (MC based) 1994Rudolpha.s. convergence of non-elitist ES (based on supermartingales) 1996Rudolphconditions for convergence ) convergence proofs are no issue any longer!

Lecture 12 G. Rudolph: Computational Intelligence Winter Term 2012/13 11 A simple proof of convergence (I) Theorem: Let D k = | f(x k ) – f* | with k 0 be generated by (1+1)-EA, S* = { x* S : f(x*) = f* } is set of optimal solutions and P m (x, S*) is probability to get from x S to S* by a single mutation operation. If for each x S \ S* holds P m (x, S*) > 0, then D k 0 completely and in mean. Remark: The proofs become simpler and simpler. Borns proof (1978) took about 10 pages. Eibens proof (1989) took about 2 pages. Rudolphs proof (1996) takes about 1 slide …

Lecture 12 G. Rudolph: Computational Intelligence Winter Term 2012/13 12 Proof: For the (1+1)-EA holds: P(x, S*) = 1 for x S* due to elitist selection. Thus, it is sufficient to show that the EA reaches S* with probability 1: Success in 1st iteration: P m (x, S*). No success in 1st iteration: 1 -. No success in kth iteration: (1- ) k. at least one success in k iterations: 1 - (1- ) k 1 as k. Since P{ D k > } (1- ) k 0 we have convergence in probablity and since we actually have complete convergence. Moreover: 8 k 0: 0 D k D 0 < 1, implies convergence in mean. A simple proof of convergence (II)

Lecture 12 G. Rudolph: Computational Intelligence Winter Term 2012/13 13 ad 3) Speed of Convergence Observation: Sometimes EAs have been very slow … Questions: Why is this the case? Can we do something against this? ) no speculations, instead: formal analysis! first hint in Schwefels masters thesis (1965): observed that step size adaptation in R 2 useful!

Lecture 12 G. Rudolph: Computational Intelligence Winter Term 2012/13 14 convergence speed without step size adaptation (pure random search) f(x) = || x || 2 = xx min! where x S n (r) = { x R n : || x || r } Z k is uniformly distributed in S n (r) X k+1 = Z k if f(Z k ) < f(X k ), else X k+1 = X k V k = min { f(Z 1 ), f(Z 2 ), …, f(Z k ) } best objective function value until iteration k P{ || Z || z } = P{ Z S n (z) } = Vol( S n (z) ) / Vol( S n (r) ) = ( z / r ) n, 0 z r P{ || Z || 2 z } = P{ || Z || z 1/2 } = z n/2 / r n, 0 z r 2 P{ V k v } = 1 - ( 1 - P{ || Z || 2 v } ) k = 1 – ( 1 – v n/2 / r n ) k E[ V k ] r 2 (1 + 2/n ) k -2/n for large k ad 3) Speed of Convergence no adaptation: D k = (k -2/n )

Lecture 12 G. Rudolph: Computational Intelligence Winter Term 2012/13 15 convergence speed without step size adaptation (local uniformly distr.) f(x) = || x || 2 = xx min! where x S n (r) = { x R n : || x || r } Z k uniformly distributed in [-r, r], n = 1 X k+1 = X k + Z k if f(X k + Z k ) < f(X k ), else X k+1 = X k 0 r0 from now on: resembling pure random search! no adaptation: D k = (k -2/n ) ad 3) Speed of Convergence

Lecture 12 G. Rudolph: Computational Intelligence Winter Term 2012/13 16 (1, )-EA mit f(x) = || x || 2 || Y k || 2 = || X k + r k U k || 2 = (X k + r k U k ) (X k + r k U k ) = X k X k + 2r k X k U k +r k 2 U k U k = ||X k || 2 + 2r k X k U k + r k 2 ||U k || 2 = ||X k || 2 + 2X k U k + r k 2 = 1 note: random scalar product xU has same distribution like ||x|| B, where r.v. B beta-distributed with parameters (n-1)/2 on [-1, 1]. It follows, that || Y k || 2 = || X k || 2 + 2r k || X k || B + r k 2. Since (1, )-EA selects best value out of trials in total, we obtain || X k+1 || 2 = || X k || 2 + 2r k || X k || B 1: + r k 2 ad 3) Speed of Convergence convergence speed with step size adaptation (uniform distribution on S n (1))

Lecture 12 G. Rudolph: Computational Intelligence Winter Term 2012/13 17 || X k+1 || 2 = || X k || 2 + 2r k || X k || B 1: + r k 2 conditional expection on both sides E|| X k+1 || 2 = || X k || 2 + 2r k || X k || E[ B 1: + r k 2 assume: r k = || X k || E|| X k+1 || 2 = || X k || || X k || 2 E[ B 1: + 2 || X k || 2 symmetry of B implies E[B 1: ] = - E[B : ] < 0 E|| X k+1 || 2 = || X k || || X k || 2 E[ B : + 2 || X k || 2 = || X k || 2 (1 – 2 E[B : ] + 2 ) minimum at * = E[B : ], thus E|| X k+1 || 2 = || X k || 2 ( 1 – E[B : ] 2 ) with adaptation: D k » O(c k ), c 2 (0,1) ad 3) Speed of Convergence

Lecture 12 G. Rudolph: Computational Intelligence Winter Term 2012/13 18 problem in practice: how do we get || X k || for r k = || X k || ¢ E[ B : ] ? We know from analysis:E|| X k+1 || 2 = || X k || 2 ( 1 – E[B : ] 2 ) assume: r k was optimally adjusted ) r k+1 = || X k+1 || E[ B : ] ¼ || X k || (1 – E[ B : ] 2 ) 1/2 E[ B : ] constant! ) multiply r k with constant: r k+1 = c ¢ r k but: how do we get r 0 or || X 0 || ? r 0 too large r 0 almost optimal r 0 too small ad 3) Speed of Convergence

Lecture 12 G. Rudolph: Computational Intelligence Winter Term 2012/13 19 (1+1)-EA with step-size adaptation (1/5 success rule, Rechenberg 1973) Idea: If many successful mutation, then step size too small. If few successful mutations, then step size too large. approach: count successful mutations in certain time interval if fraction larger than some threshold (z. B. 1/5), then increase step size by factor > 1, else decrease step size by factor < 1. for infinitesimal small radius: success rate = 1/2 ad 3) Speed of Convergence

Lecture 12 G. Rudolph: Computational Intelligence Winter Term 2012/13 20 ad 3) Speed of Convergence empirically known since 1973: step size adaptation increases convergence speed dramatically! about 1993 EP adopted multiplicative step size adaptation (was additive) no proof of convergence! 1999Rudolphno a.s. convergence for all continuous functions 2003Jägersküppersshows a.s. convergence for convex problems and linear convergence speed ) same order of local convergence speed like gradient method!

Lecture 12 G. Rudolph: Computational Intelligence Winter Term 2012/13 21 Towards CMA-ES mutation: Y = X + ZZ » N(0, C) multinormal distribution maximum entropy distribution for support R n, given expectation vector and covariance matrix how should we choose covariance matrix C? unless we have not learned something about the problem during search ) dont prefer any direction! ) covariance matrix C = I n (unit matrix) x x C = I n x C = diag(s 1,...,s n ) C orthogonal

Lecture 12 G. Rudolph: Computational Intelligence Winter Term 2012/13 22 Towards CMA-ES claim: mutations should be aligned to isolines of problem (Schwefel 1981) if true then covariance matrix should be inverse of Hessian matrix! ) assume f(x) ¼ ½ xAx + bx + c ) H = A Z » N(0, C) with density since then many proposals how to adapt the covariance matrix ) extreme case: use n+1 pairs (x, f(x)), apply multiple linear regression to obtain estimators for A, b, c invert estimated matrix A! OK, but: O(n 6 )! (Rudolph 1992)

Lecture 12 G. Rudolph: Computational Intelligence Winter Term 2012/13 23 Towards CMA-ES doubts: are equi-aligned isolines really optimal? most (effective) algorithms behave like this: run roughly into negative gradient direction, sooner or later we approach longest main principal axis of Hessian, now negative gradient direction coincidences with direction to optimum, which is parallel to longest main principal axis of Hessian, which is parallel to the longest main principal axis of the inverse covariance matrix (Schwefel OK in this situation) principal axis should point into negative gradient direction! (proof next slide)

Lecture 12 G. Rudolph: Computational Intelligence Winter Term 2012/13 24 Towards CMA-ES Z = rQu, A = BB, B = Q -1 if Qu were deterministic... ) set Qu = -rf(x) (direction of steepest descent)

Lecture 12 Apart from (inefficient) regression, how can we get matrix elements of Q? Towards CMA-ES ) iteratively: C (k+1) = update( C (k), Population (k) ) basic constraint: C (k) must be positive definite (p.d.) and symmetric for all k 0, otherwise Cholesky decomposition impossible: C = QQ Lemma Let A and B be quadratic matrices and, > 0. a)A, B symmetric ) A + B symmetric. b)A positive definite and B positive semidefinite ) A + B positive definite Proof: ad a) C = A + B symmetric, since c ij = a ij + b ij = a ji + b ji = c ji ad b) 8x 2 R n n {0}: x( A + B) x = xAx + xBx > 0 0 G. Rudolph: Computational Intelligence Winter Term 2012/13 25

Lecture 12 Theorem A quadratic matrix C (k) is symmetric and positive definite for all k 0, if it is built via the iterative formula C (k+1) = k C (k) + k v k v k where C (0) = I n, v k 0, liminf k > 0 and liminf k > 0. Proof: If v 0, then matrix V = vv is symmetric and positive semidefinite, since as per definition of the dyadic product v ij = v i ¢ v j = v j ¢ v i = v ji for all i, j and for all x 2 R n : x (vv) x = (xv) ¢ (vx) = (xv) 2 0. Thus, the sequence of matrices v k v k is symmetric and p.s.d. for k 0. Owing to the previous lemma matrix C (k+1) is symmetric and p.d., if C (k) is symmetric as well as p.d. and matrix v k v k is symmetric and p.s.d. Since C (0) = I n symmetric and p.d. it follows that C (1) is symmetric and p.d. Repetition of these arguments leads to the statement of the theorem. Towards CMA-ES G. Rudolph: Computational Intelligence Winter Term 2012/13 26

Lecture 12 Idea: Dont estimate matrix C in each iteration! Instead, approximate iteratively! (Hansen, Ostermeier et al. 1996ff.) Covariance Matrix Adaptation Evolutionary Algorithm (CMA-EA) dyadic product: dd = Set initial covariance matrix to C (0) = I n C (t+1) = (1- ) C (t) + w i d i d i : learning rate 2 (0,1) d i = (x i: – m) / sorting: f(x 1: ) f(x 2: )... f(x : ) m = mean of all selected parents is positive semidefinite dispersion matrix complexity: O( n 2 + n 3 ) CMA-ES G. Rudolph: Computational Intelligence Winter Term 2012/13 27

Lecture 12 m = mean of all selected parents variant: p (t+1) = (1 - ) p (t) + ( (2 - ) eff ) 1/2 (m (t) – m (t-1) ) / (t) Evolution path 2 (0,1) p (0) = 0 C (0) = I n C (t+1) = (1 - ) C (t) + p (t) (p (t) ) complexity: O(n 2 ) Cholesky decomposition: O(n 3 ) für C (t) CMA-ES G. Rudolph: Computational Intelligence Winter Term 2012/13 28

Lecture 12 State-of-the-art: CMA-EA (currently many variants) successful applications in practice available in WWW: (EAlib, C++) … CMA-ES C, C++, Java Fortran, Python, Matlab, R, Scilab G. Rudolph: Computational Intelligence Winter Term 2012/13 29