Optimization Indiana University July 12 2013 Geoffrey Fox

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Prediction Modeling for Personalization & Recommender Systems Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Aggregating local image descriptors into compact codes
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Computer vision: models, learning and inference Chapter 13 Image preprocessing and feature extraction.
Dimensionality Reduction PCA -- SVD
Panel: New Opportunities in High Performance Data Analytics (HPDA) and High Performance Computing (HPC) The 2014 International Conference.
Collective Collaborative Tagging System Jong Y. Choi, Joshua Rosen, Siddharth Maini, Marlon E. Pierce, and Geoffrey C. Fox Community Grids Laboratory Indiana.
1 Machine Learning with Apache Hama Tommaso Teofili tommaso [at] apache [dot] org.
Linear Discriminant Functions
Visual Recognition Tutorial
Collaborative Filtering in iCAMP Max Welling Professor of Computer Science & Statistics.
Gradient Methods April Preview Background Steepest Descent Conjugate Gradient.
Principal Component Analysis
Motion Analysis (contd.) Slides are from RPI Registration Class.
CSci 6971: Image Registration Lecture 4: First Examples January 23, 2004 Prof. Chuck Stewart, RPI Dr. Luis Ibanez, Kitware Prof. Chuck Stewart, RPI Dr.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Recommendations via Collaborative Filtering. Recommendations Relevant for movies, restaurants, hotels…. Recommendation Systems is a very hot topic in.
Face Recognition Based on 3D Shape Estimation
Clustering Color/Intensity
TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.
Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.

Collaborative Filtering Matrix Factorization Approach
Neural Networks Lecture 8: Two simple learning algorithms
Chapter 12 (Section 12.4) : Recommender Systems Second edition of the book, coming soon.
Image Segmentation Image segmentation is the operation of partitioning an image into a collection of connected sets of pixels. 1. into regions, which usually.
The content of these slides by John Galeotti, © Carnegie Mellon University (CMU), was made possible in part by NIH NLM contract# HHSN P,
X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox Associate Dean for.
Remarks on Big Data Clustering (and its visualization) Big Data and Extreme-scale Computing (BDEC) Charleston SC May Geoffrey Fox
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.
EMIS 8381 – Spring Netflix and Your Next Movie Night Nonlinear Programming Ron Andrews EMIS 8381.
80 million tiny images: a large dataset for non-parametric object and scene recognition CS 4763 Multimedia Systems Spring 2008.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
Computer Animation Rick Parent Computer Animation Algorithms and Techniques Optimization & Constraints Add mention of global techiques Add mention of calculus.
Multidimensional Scaling by Deterministic Annealing with Iterative Majorization Algorithm Seung-Hee Bae, Judy Qiu, and Geoffrey Fox SALSA group in Pervasive.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
CHAPTER 4, Part II Oliver Schulte Summer 2011 Local Search.
Data Modeling Patrice Koehl Department of Biological Sciences National University of Singapore
Chapter 13 (Prototype Methods and Nearest-Neighbors )
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
Chapter 2-OPTIMIZATION G.Anuradha. Contents Derivative-based Optimization –Descent Methods –The Method of Steepest Descent –Classical Newton’s Method.
Collaborative Filtering via Euclidean Embedding M. Khoshneshin and W. Street Proc. of ACM RecSys, pp , 2010.
INTRO TO OPTIMIZATION MATH-415 Numerical Analysis 1.
Matrix Factorization & Singular Value Decomposition Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Recommendation Systems By: Bryan Powell, Neil Kumar, Manjap Singh.
May 2003 SUT Color image segmentation – an innovative approach Amin Fazel May 2003 Sharif University of Technology Course Presentation base on a paper.
Machine Learning Supervised Learning Classification and Regression
Clustering (1) Clustering Similarity measure Hierarchical clustering
Optimization: Algorithms and Applications
LECTURE 11: Advanced Discriminant Analysis
CH 5: Multivariate Methods
Haim Kaplan and Uri Zwick
Clustering (3) Center-based algorithms Fuzzy k-means
Adopted from Bin UIC Recommender Systems Adopted from Bin UIC.
Latent Variables, Mixture Models and EM
Roberto Battiti, Mauro Brunato
Advanced Artificial Intelligence
Collaborative Filtering Matrix Factorization Approach
SMEM Algorithm for Mixture Models
Clustering 77B Recommender Systems
Basic Kmeans Clustering and Optimization
Indiana University July Geoffrey Fox
MCMC Inference over Latent Diffeomorphisms
Biointelligence Laboratory, Seoul National University
Research Institute for Future Media Computing
Recommendation Systems
Presentation transcript:

Optimization Indiana University July Geoffrey Fox

General Optimization Problem Have a function E that depends on up to billions of parameters Can always make optimization as minimization Often E guaranteed to be positive as sum of squares “Continuous Parameters” – e.g. Cluster centers Expectation Maximization “Discrete Parameters” – e.g. Assignment problems Genetic Algorithms Some methods just use function evaluations Safest methods – Calculate first but not second Derivatives Expectation Maximization Steepest Descent always gets stuck but always decreases E One dimension – line searches – very easy Fastest Methods – Newton’s method with second derivatives Always diverges in naïve version For least squares, second derivative of E only needs first derivatives of components Constraints Use penalty functions

Some Examples Clustering. Optimize position of clusters to minimize least squares sum of (particle position – center position) 2 Analysis of CReSIS snow layer data. Optimize continuous layers to reproduce observed radar data Dimension reduction MDS. Assign 3D position to each point to minimize least sum of (original distance – Euclidean distance) 2 for each point pair Web Search. Find web pages that are “nearest your query” Recommender Systems. Find movies (Netflix), books (Amazon) etc. that you want and which will cause you to spend as much money on site that originator will get credit Deep learning: Optimize classification (learning of features) of faces etc. Distances in “funny spaces” often important

Features of Functions Always there are local minima Remove by annealing Remove by trying lots of starting positions Always there are parameters or combinations of parameters that E is essentially independent of These correspond to small eigenvalues of second derivative matrix Note all eigenvalues positive for “decent” E Steepest descent shift proportional to eigenvalue Newton shift proportional to 1/eigenvalue Remove by Levenberg-Marquardt or equivalent

AI Knowledge and all that

X-Informatics and Physical Computation Spring 2002

Mother Nature’s Approach to Optimizing Complex Systems

X-Informatics and Physical Computation Spring 2002

TEMPERATURE AVOIDS LOCAL MINIMA

Distances in Funny Spaces I In user-based collaborative filtering, we can think of users in a space of dimension N where there are N items and M users. – Let i run over items and u over users Then each user is represented as a vector U i (u) in “item-space” where ratings are vector components. We are looking for users u u’ that are near each other in this space as measured by some distance between U i (u) and U i (u’) If u and u’ rate all items then these are “real” vectors but almost always they each only rates a small fraction of items and the number in common is even smaller The “Pearson coefficient” is just one distance measure that can be used – Only sum over i rated by u and u’ Last.fm uses this for songs as does Amazon, Netflix

Distances in Funny Spaces II In item-based collaborative filtering, we can think of items in a space of dimension M where there are N items and M users. – Let i run over items and u over users Then each item is represented as a vector R u (i) in “user-space” where ratings are vector components. We are looking for items i i’ that are near each other in this space as measured by some distance between R u (i) and R u (i’) If i and i’ rated by all users then these are “real” vectors but almost always they are each only rated by a small fraction of users and the number in common is even smaller The “Cosine measure” is just one distance measure that can be used – Only sum over users u rating both i and i’

Distances in Funny Spaces III In content based recommender systems, we can think of items in a space of dimension M where there are N items and M properties. – Let i run over items and p over properties Then each item is represented as a vector P p (i) in “property-space” where values of properties are vector components. We are looking for items i i’ that are near each other in this space as measured by some distance between P p (i) and R p (i’) Properties could be “reading age” or “character highlighted” or “author” for books Properties can be genre or artist for songs and video Properties can characterize pixel structure for images used in face recognition, driving etc. Pandora uses this for songs (Music Genome) as does Amazon, Netflix

Do we need “real” spaces? Much of X-Informatics involves “points” – Events in LHC analysis – Users (people) or items (books, jobs, music, other people) These points can be thought of being in a “space” or “bag” – Set of all books – Set of all physics reactions – Set of all Internet users However as in recommender systems where a given user only rates some items, we don’t know “full position” However we can nearly always define a useful distance d(a,b) between points Always d(a,b) >= 0 Usually d(a,b) = d(b,a) Rarely d(a,b) + d(b,c) >= d(a,c) Triangle Inequality

Using Distances The simplest way to use distances is “nearest neighbor algorithms” – given one point, find a set of points near it – cut off by number of identified nearby points and/or distance to initial point – Here point is either user or item Another approach is divide space into regions (topics, latent factors) consisting of nearby points – This is clustering – Also other algorithms like Gaussian mixture models or Latent Semantic Analysis or Latent Dirichlet Allocation which use a more sophisticated model

Tool Set There are not many optimization methods and we can produce a tool-set that supports them They are all iterative Expectation Maximization Deterministic Annealing Conjugate Gradient (always?) fastest equation solver?