Download presentation
Presentation is loading. Please wait.
Published byHeather Andrews Modified over 9 years ago
1
Machine Learning: A Brief Introduction Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819 fchang@iis.sinica.edu.tw
2
Machine Learning: as a Tool for Classifying Patterns What is the difference between you and me? Tentative answer 1: You are pretty, and I am ugly A vague answer, not very useful Tentative answer 2: You have a tiny mouth, and I have a big one A lot more useful, but what if we are viewed from the side? In general, can we use a single feature difference to distinguish one pattern from another?
3
Old Philosophical Debates What makes a cup a cup? Philosophical views Plato: the ideal type Aristotle: the collection of all cups Wittgenstein: family resemblance
4
Machine Learning Viewpoint Represent each object with a set of features: Mouth, nose, eyes, etc., viewed from the front, the right side, the left side, etc. Each pattern is taken as a conglomeration of sample points or feature vectors
5
A B Two types of sample points Patterns as Conglomerations of sample Points
6
ML Viewpoint (Cnt’d) Training phase: Want to learn pattern differences among conglomerations of labeled samples Have to describe the differences by means of a model: probability distribution, prototype, neural network, etc. Have to estimate parameters involved in the model Testing phase: Have to classify at acceptable accuracy rates
7
Models Neural networks Support vector machines Classification and regression tree AdaBoost Statistical models Prototype classifiers
8
Neural Networks
9
Back-Propagation Neural Networks Layers: Input: number of nodes = dimension of feature vector Output: number of nodes = number of class types Hidden: number of nodes > dimension of feature vector Direction of data migration Training: backward propagation Testing: forward propagation Training problems Overfitting Convergence
10
Illustration
11
Support Vector Machines (SVM)
12
SVM Gives rise to the optimal solution to binary classification problem Finds a separating boundary (hyperplane) that maintains the largest margin between samples of two class types Things to tune up with: Kernel functions: defining the similarity measure of two sample vectors Tolerance for misclassification Parameters associated with the kernel function
13
Illustration
14
Classification and Regression Tree (CART)
15
Illustration
16
AdaBoost
17
Can be thought as a linear combination of the same classifier c(·, ·) with varying weights The Idea: Iteratively apply the same classifier c to a set of samples At iteration m, the samples erroneously classified at (m- 1) st iteration are duplicated at a rate γ m The weight β m is related to γ m in a certain way
18
Statistical Models
19
Bayesian Approach Given: Training samples X = {x 1, x 2, …, x n } Probability density p(t|Θ) t is an arbitrary vector (a test sample) Θ is the set of parameters Θ is taken as a set of random variables
20
Bayesian Approach (Cnt’d) Posterior density: Different class types give rise to different posteriors Use the posteriors to evaluate the class type of a given test sample t
21
A Bayesian Model with Hidden Variables In addition to the observed data X, there exist some hidden data H H is taken as a set of random variables We want to optimize with both Θ and H as unknown Some iterative procedure (EM algorithm) is required to do this
22
Hidden Markov Model (HMM) HMM is a Bayesian model with hidden variables The observed data consist of sequences of samples The hidden variables are sequences of consecutive states
23
Boltzmann-Gibbs Distribution Given: States s 1, s 2, …, s n Density p(s) = p s Maximum entropy principle: Without any information, one chooses the density p s to maximize the entropy subject to the constraints
24
Boltzmann-Gibbs (Cnt’d) Consider the Lagrangian Take partial derivatives of L with respect to p s and set them to zero, we obtain Boltzmann-Gibbs density functions where Z is the normalizing factor
25
Boltzmann-Gibbs (Cnt’d) Maximum entropy (ME) Use of Boltzmann-Gibbs as prior distribution Compute the posterior for given observed data and features f i Use the optimal posterior to classify
26
Boltzmann-Gibbs (Cnt’d) Maximum entropy Markov model (MEMM) The posterior consists of transition probability densities p(s | s´, X) Conditional random field (CRF) The posterior consists of both transition probability densities p(s | s´, X) and state probability densities p(s | X)
27
References R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2 nd Ed., Wiley Interscience, 2001. T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning, Springer-Verlag, 2001. P. Baldi and S. Brunak, Bioinformatics: The Machine Learning Appraoch, The MIT Press, 2001.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.