Machine Learning: A Brief Introduction Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819

Machine Learning: A Brief Introduction Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819 fchang@iis.sinica.edu.tw

Machine Learning: as a Tool for Classifying Patterns What is the difference between you and me? Tentative answer 1: You are pretty, and I am ugly A vague answer, not very useful Tentative answer 2: You have a tiny mouth, and I have a big one A lot more useful, but what if we are viewed from the side? In general, can we use a single feature difference to distinguish one pattern from another?

Old Philosophical Debates What makes a cup a cup? Philosophical views Plato: the ideal type Aristotle: the collection of all cups Wittgenstein: family resemblance

Machine Learning Viewpoint Represent each object with a set of features: Mouth, nose, eyes, etc., viewed from the front, the right side, the left side, etc. Each pattern is taken as a conglomeration of sample points or feature vectors

A B Two types of sample points Patterns as Conglomerations of sample Points

ML Viewpoint (Cnt’d) Training phase: Want to learn pattern differences among conglomerations of labeled samples Have to describe the differences by means of a model: probability distribution, prototype, neural network, etc. Have to estimate parameters involved in the model Testing phase: Have to classify at acceptable accuracy rates

Models Neural networks Support vector machines Classification and regression tree AdaBoost Statistical models Prototype classifiers

Neural Networks

Back-Propagation Neural Networks Layers: Input: number of nodes = dimension of feature vector Output: number of nodes = number of class types Hidden: number of nodes > dimension of feature vector Direction of data migration Training: backward propagation Testing: forward propagation Training problems Overfitting Convergence

Illustration

Support Vector Machines (SVM)

SVM Gives rise to the optimal solution to binary classification problem Finds a separating boundary (hyperplane) that maintains the largest margin between samples of two class types Things to tune up with: Kernel functions: defining the similarity measure of two sample vectors Tolerance for misclassification Parameters associated with the kernel function

Illustration

Classification and Regression Tree (CART)

Illustration

AdaBoost

Can be thought as a linear combination of the same classifier c(·, ·) with varying weights The Idea: Iteratively apply the same classifier c to a set of samples At iteration m, the samples erroneously classified at (m- 1) st iteration are duplicated at a rate γ m The weight β m is related to γ m in a certain way

Statistical Models

Bayesian Approach Given: Training samples X = {x 1, x 2, …, x n } Probability density p(t|Θ) t is an arbitrary vector (a test sample) Θ is the set of parameters Θ is taken as a set of random variables

Bayesian Approach (Cnt’d) Posterior density: Different class types give rise to different posteriors Use the posteriors to evaluate the class type of a given test sample t

A Bayesian Model with Hidden Variables In addition to the observed data X, there exist some hidden data H H is taken as a set of random variables We want to optimize with both Θ and H as unknown Some iterative procedure (EM algorithm) is required to do this

Hidden Markov Model (HMM) HMM is a Bayesian model with hidden variables The observed data consist of sequences of samples The hidden variables are sequences of consecutive states

Boltzmann-Gibbs Distribution Given: States s 1, s 2, …, s n Density p(s) = p s Maximum entropy principle: Without any information, one chooses the density p s to maximize the entropy subject to the constraints

Boltzmann-Gibbs (Cnt’d) Consider the Lagrangian Take partial derivatives of L with respect to p s and set them to zero, we obtain Boltzmann-Gibbs density functions where Z is the normalizing factor

Boltzmann-Gibbs (Cnt’d) Maximum entropy (ME) Use of Boltzmann-Gibbs as prior distribution Compute the posterior for given observed data and features f i Use the optimal posterior to classify

Boltzmann-Gibbs (Cnt’d) Maximum entropy Markov model (MEMM) The posterior consists of transition probability densities p(s | s´, X) Conditional random field (CRF) The posterior consists of both transition probability densities p(s | s´, X) and state probability densities p(s | X)

References R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2 nd Ed., Wiley Interscience, 2001. T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning, Springer-Verlag, 2001. P. Baldi and S. Brunak, Bioinformatics: The Machine Learning Appraoch, The MIT Press, 2001.

Machine Learning: A Brief Introduction Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819

Similar presentations

Presentation on theme: "Machine Learning: A Brief Introduction Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Machine Learning: A Brief Introduction Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819

Similar presentations

Presentation on theme: "Machine Learning: A Brief Introduction Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819"— Presentation transcript:

Similar presentations

About project

Feedback