Topologically Adaptive Stochastic Search I.E. Lagaris & C. Voglis Department of Computer Science University of Ioannina - GREECE IOANNINA ATHENS THESSALONIKI.

Slides:



Advertisements
Similar presentations
Tests of Hypotheses Based on a Single Sample
Advertisements

Principles of Density Estimation
Reactive and Potential Field Planners
Efficient access to TIN Regular square grid TIN Efficient access to TIN Let q := (x, y) be a point. We want to estimate an elevation at a point q: 1. should.
Optimization : The min and max of a function
1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.
Nonlinear Regression Ecole Nationale Vétérinaire de Toulouse Didier Concordet ECVPT Workshop April 2011 Can be downloaded at
Chapter 19 Confidence Intervals for Proportions.
Instructor: Mircea Nicolescu Lecture 13 CS 485 / 685 Computer Vision.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
1 Complexity of Network Synchronization Raeda Naamnieh.
Performance Optimization
Gradient Methods April Preview Background Steepest Descent Conjugate Gradient.
Numerical Optimization
Chapter 3 Simple Regression. What is in this Chapter? This chapter starts with a linear regression model with one explanatory variable, and states the.
Motion Analysis (contd.) Slides are from RPI Registration Class.
CSci 6971: Image Registration Lecture 4: First Examples January 23, 2004 Prof. Chuck Stewart, RPI Dr. Luis Ibanez, Kitware Prof. Chuck Stewart, RPI Dr.
MAE 552 – Heuristic Optimization Lecture 6 February 6, 2002.
Clustering Color/Intensity
Gradient Methods May Preview Background Steepest Descent Conjugate Gradient.
Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction
MAE 552 – Heuristic Optimization
A Clustered Particle Swarm Algorithm for Retrieving all the Local Minima of a function C. Voglis & I. E. Lagaris Computer Science Department University.
Unsupervised Learning
Clustering & Dimensionality Reduction 273A Intro Machine Learning.
Nonlinear Stochastic Programming by the Monte-Carlo method Lecture 4 Leonidas Sakalauskas Institute of Mathematics and Informatics Vilnius, Lithuania EURO.
CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.
Business Statistics: Communicating with Numbers
Collaborative Filtering Matrix Factorization Approach
October 8, 2013Computer Vision Lecture 11: The Hough Transform 1 Fitting Curve Models to Edges Most contours can be well described by combining several.
Ch 8.1 Numerical Methods: The Euler or Tangent Line Method
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
1 Hybrid methods for solving large-scale parameter estimation problems Carlos A. Quintero 1 Miguel Argáez 1 Hector Klie 2 Leticia Velázquez 1 Mary Wheeler.
October 14, 2014Computer Vision Lecture 11: Image Segmentation I 1Contours How should we represent contours? A good contour representation should meet.
CHAPTER 4 S TOCHASTIC A PPROXIMATION FOR R OOT F INDING IN N ONLINEAR M ODELS Organization of chapter in ISSO –Introduction and potpourri of examples Sample.
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
Section 8.1 Estimating  When  is Known In this section, we develop techniques for estimating the population mean μ using sample data. We assume that.
Nonlinear programming Unconstrained optimization techniques.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 16: NEURAL NETWORKS Objectives: Feedforward.
Informed State Space Search Department of Computer Science & Engineering Indian Institute of Technology Kharagpur.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 9: Ways of speeding up the learning and preventing overfitting Geoffrey Hinton.
Brian Macpherson Ph.D, Professor of Statistics, University of Manitoba Tom Bingham Statistician, The Boeing Company.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
STAR Sti, main features V. Perevoztchikov Brookhaven National Laboratory,USA.
PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Principles of Parameter Estimation.
Vaida Bartkutė, Leonidas Sakalauskas
Copyright © 2014 by McGraw-Hill Higher Education. All rights reserved. Essentials of Business Statistics: Communicating with Numbers By Sanjiv Jaggia and.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Inen 460 Lecture 2. Estimation (ch. 6,7) and Hypothesis Testing (ch.8) Two Important Aspects of Statistical Inference Point Estimation – Estimate an unknown.
CHAPTER 10 Widrow-Hoff Learning Ming-Feng Yeh.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
EEE502 Pattern Recognition
Optimal Path Planning Using the Minimum-Time Criterion by James Bobrow Guha Jayachandran April 29, 2002.
METHOD OF STEEPEST DESCENT ELE Adaptive Signal Processing1 Week 5.
Visual Tracking by Cluster Analysis Arthur Pece Department of Computer Science University of Copenhagen
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
August, 2005 Department of Computer Science, University of Ioannina Ioannina - GREECE 1 To STOP or not to STOP By I. E. Lagaris A question in Global Optimization.
May 2003 SUT Color image segmentation – an innovative approach Amin Fazel May 2003 Sharif University of Technology Course Presentation base on a paper.
CSE 330: Numerical Methods. What is true error? True error is the difference between the true value (also called the exact value) and the approximate.
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
12. Principles of Parameter Estimation
A Simple Artificial Neuron
Fitting Curve Models to Edges
Collaborative Filtering Matrix Factorization Approach
Introduction to Simulated Annealing
12. Principles of Parameter Estimation
Supervised machine learning: creating a model
Presentation transcript:

Topologically Adaptive Stochastic Search I.E. Lagaris & C. Voglis Department of Computer Science University of Ioannina - GREECE IOANNINA ATHENS THESSALONIKI

Global Optimization  The goal is to find the Global minimum (or minima) inside a bounded domain:  One way to do that, is to find all the local minima and choose among them the global one (or ones).  A popular method of that kind is the so called “Multistart”.

Local Optimization  Let a point  Starting from x, a local search procedure L, reaches a minimum  This may be denoted as: Multistart applies repeatedly a local optimization procedure.

Regions of Attraction  For a local search procedure L, the region of attraction of the minimum y i, is defined by: Observe the dependence on L.

“IDEAL” MultiStart (IMS)  This is a version in which every local minimum is found only once.  It assumes that from the position of a minimum, its region of attraction may be directly determined.  Since this is a false assumption, IMS is of no practical value.  However it offers a framework and a target.

1.Initialize: Set k=1 Sample 2.Terminate if a stopping rule applies 3.Sample 4.Main Step: 5.Iterate: Go back to step 2. Ideal MultiStart (IMS)

Making IMS practical  Since the regions of attraction of the minima discovered so far, are not known, it is not possible to determine if a point belongs or not to their union.  However, a probability may be estimated, based on several assumptions.  Hence, a stochastic modification may render IMS useful.

Stochastic modification of the main step 4.Main Step: Estimate the probability p, that Apply a local search with probability p. If Then Endif

The probability estimation  Overestimated probability ( p→1 ), increases the computational cost, and transforms the algorithm towards the standard MultiStart.  Underestimated probability will cause an iteration delay without significant computational cost. (Only sampling, no local search).

Probability model  If a sample point is close to an already known minimizer, the probability that it does not belong to its region of attraction is small and zero at the limit of complete coincidence.  From the above follows that:

Probability model  Let  If, R i being a radius such that A i is contained in the sphere ( y i, R i ), then certainly:  Hence

Probability model Where and P 3 (z) is a cubic polynomial so that both are continuous.

Defining the model parameters  There are three parameters to specify for each z. Namely: a, r, R.  All of them will depend on the associated minimum y i, and the iteration count ( k ), i.e. a=a i (k), r=r i (k), and R=R i (k).

Interpreting the model parameters  r i is the distance below which the probability is descending quadratically and depends on the size of the “valley”.  As the algorithm proceeds, y i may be discovered repeatedly. Every time it is rediscovered, r i is increased in order to adapt to the local geometry.

Interpreting the model parameters  a i is the probability at z i = r i  As y i is being rediscovered, a i should be decreased to render a future rediscovery less probable.  If l i is the number of times y i is being discovered so far, then we set:

Choosing the model parameters  r i is being increased as: and is safeguarded by:  η being the machine precision.  R i is taken to be and is updated every time a local search rediscovers y i.

Gradient Information  In the case where d=y i -x is descent, the probability is reduced by a factor p g  [0,1].  p g is zero when d is parallel to, and one when it is perpendicular to it.  Namely this factor is given by: and is used only when z i  [0.7r i,0.9r i ]

Ascending Gradient Rule  If the direction is not descent at x, i.e. if it signals that x is not “attracted” towards y i, i.e. does not fall inside its region of attraction. In this case

Asymptotic guaranty  The previous gradient rule, together with the model s(x) guarantee that asymptotically all minima will be found with probability one.  Hence the global minimum will surely be recovered asymptotically.

Probability  Having estimated the probability we can estimate ideally as: However the product creates a problem illustrated next.

The probability at x is reduced since it falls inside two spheres centered at y i and y j. Note that x will lead to a new minimum and ideally its probability should have been high. This is an effect that may be amplified in many dimensions. Local minimum not discovered yet

Estimating the probability  To circumvent this problem we consider the following estimate: Where the index “ cn ” stands for Closest Neighbor. Namely we take in account only the closest minimizer.

Local nature of the probability  The probability model is based on distances from the discovered minima.  It is implicitly assumed that the closer to a minimum a point is, the greater the probability that falls inside its RA.  This is not true for all local search procedures L.

Local search properties Regions of attraction should contain the minimum and be contiguous. Ideally the regions of attraction should resemble the ones produced by a descent method with infinitesimal step. So the local search should be carefully chosen. The local search dictates the shape of the regions of attraction.

Desired local search Simplex, with small initial opening

Undesired local search BFGS with strong Wolfe line search

Ackley Rastrigin Griewangk Shubert

Rotated Quadratics This test function is constructed  so that its contours form non-convex domains.  C. Voglis, private communication

Preliminary results

Parallel processing  The described process uses a single sample point and performs a local search with a probability.  If many points are sampled, multiple local searches may be performed in parallel, gaining so significantly in performance.

Parallel processing gain  Note however that the probability estimation will be based on data that are updated in batches.  This update delay is significant in the first few rounds only.  A further gain may be possible using a clustering technique before the local search is applied.

Clustering filter Sample M points Estimate the probability to start a local search (LS). Decide from which points a LS will start. Apply to these points a clustering technique and decide to start a LS from only one point of each cluster. Send the selected points to the available processors that will perform the LS.