Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

DECISION TREES. Decision trees  One possible representation for hypotheses.
Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.
Data Mining Classification: Alternative Techniques
K Means Clustering , Nearest Cluster and Gaussian Mixture
Lecture 3 Nonparametric density estimation and classification
Chapter 4: Linear Models for Classification
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
Prénom Nom Document Analysis: Parameter Estimation for Pattern Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Lecture Notes for CMPUT 466/551 Nilanjan Ray
CS 590M Fall 2001: Security Issues in Data Mining Lecture 3: Classification.
By Fernando Seoane, April 25 th, 2006 Demo for Non-Parametric Classification Euclidean Metric Classifier with Data Clustering.
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Prénom Nom Document Analysis: Non Parametric Methods for Pattern Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Clustering.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Switch to Top-down Top-down or move-to-nearest Partition documents into ‘k’ clusters Two variants “Hard” (0/1) assignment of documents to clusters “soft”
Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
INSTANCE-BASE LEARNING
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
MACHINE LEARNING 6. Multivariate Methods 1. Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Motivating Example  Loan.
Multiple Object Class Detection with a Generative Model K. Mikolajczyk, B. Leibe and B. Schiele Carolina Galleguillos.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
8/10/ RBF NetworksM.W. Mak Radial Basis Function Networks 1. Introduction 2. Finding RBF Parameters 3. Decision Surface of RBF Networks 4. Comparison.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
: Chapter 10: Image Recognition 1 Montri Karnjanadecha ac.th/~montri Image Processing.
Principles of Pattern Recognition
DATA MINING LECTURE 10 Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Learning Theory Reza Shadmehr Linear and quadratic decision boundaries Kernel estimates of density Missing data.
Classification 1: generative and non-parameteric methods Jakob Verbeek January 7, 2011 Course website:
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
Pattern Recognition April 19, 2007 Suggested Reading: Horn Chapter 14.
Image Modeling & Segmentation Aly Farag and Asem Ali Lecture #2.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
Jakob Verbeek December 11, 2009
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Introduction to Machine Learning Multivariate Methods 姓名 : 李政軒.
Lecture 5: Statistical Methods for Classification CAP 5415: Computer Vision Fall 2006.
Linear Models (II) Rong Jin. Recap  Classification problems Inputs x  output y y is from a discrete set Example: height 1.8m  male/female?  Statistical.
Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
KNN & Naïve Bayes Hongning Wang
Gaussian Mixture Model classification of Multi-Color Fluorescence In Situ Hybridization (M-FISH) Images Amin Fazel 2006 Department of Computer Science.
The EM algorithm for Mixture of Gaussians & Classification with Generative models Jakob Verbeek December 2, 2011 Course website:
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Nonparametric Density Estimation – k-nearest neighbor (kNN) 02/20/17
Ch8: Nonparametric Methods
K Nearest Neighbor Classification
Lecture 26: Faces and probabilities
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Nonparametric density estimation and classification
Multivariate Methods Berlin Chen
Mathematical Foundations of BME
Multivariate Methods Berlin Chen, 2005 References:
Hairong Qi, Gonzalez Family Professor
Linear Discrimination
Hairong Qi, Gonzalez Family Professor
EM Algorithm and its Applications
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Recognition and Training
ECE – Pattern Recognition Lecture 10 – Nonparametric Density Estimation – k-nearest-neighbor (kNN) Hairong Qi, Gonzalez Family Professor Electrical.
Presentation transcript:

Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University of Pittsburgh Computer Science

ITK Questions?

Classification What is classification??

Classification Classification is simply the problem of separating different classes of data in some feature space This is a linear decision boundary…can be other types (and often are)

Classification Quadratic decision boundary These depict decision boundaries in two dimensions…feature space is n-dimensional

Features Loosely stated, a feature is a value describing something about your data points (e.g. for pixels: intensity, local gradient, distance from landmark, etc) Multiple (n) features are put together to form a feature vector, which defines a data point’s location in n-dimensional feature space

Feature Space Feature Space - The theoretical n-dimensional space occupied by n input raster objects (features). Each feature represents one dimension, and its values represent positions along one of the orthogonal coordinate axes in feature space. The set of feature values belonging to a data point define a vector in feature space. - Explain feature space (features) as it pertains to image analysis

Statistical Notation Class probability distribution: p(x,y) = p(x | y) p(y) x: feature vector – {x1,x2,x3…,xn} y: class p(x | y): probabilty of x given y p(x,y): probability of both x and y - p(x|y) = probability of x given y

Example: Binary Classification

Example: Binary Classification Two class-conditional distributions: p(x | y = 0) p(x | y = 1) Priors: p(y = 0) + p(y = 1) = 1

Modeling Class Densities In the text, they choose to concentrate on methods that use Gaussians to model class densities

Modeling Class Densities - Note that these are identical gaussians (i.e. equal covariance)

Generative Approach to Classification Represent and learn the distribution: p(x,y) Use it to define probabilistic discriminant functions e.g. go(x) = p(y = 0 | x) g1(x) = p(y = 1 | x) - Discriminant function is able to determine class given data point

Generative Approach to Classification Typical model: p(x,y) = p(x | y) p(y) p(x | y) = Class-conditional distributions (densities) p(y) = Priors of classes (probability of class y) We Want: p(y | x) = Posteriors of classes You get p(x,y), denoting the probability of the data and the class Want p(y|x), posterior

Class Modeling We model the class distributions as multivariate Gaussians x ~ N(μ0, Σ0) for y = 0 x ~ N(μ1, Σ1) for y = 1 Priors are based on training data, or a distribution can be chosen that is expected to fit the data well (e.g. Bernoulli distribution for a coin flip) - N(mu, sigma) represents a normal (gaussian) distribution

Making a class decision We need to define discriminant functions ( gn(x) ) We have two basic choices: Likelihood of data – choose the class (Gaussian) that best explains the input data (x): Posterior of class – choose the class with a better posterior probability:

Calculating Posteriors Use Bayes’ Rule: In this case,

Linear Decision Boundary When covariances are the same

Linear Decision Boundary

Linear Decision Boundary

Quadratic Decision Boundary When covariances are different

Quadratic Decision Boundary

Quadratic Decision Boundary - Ok, that’s it for Linear classifiers for now…on to more interesting stuff: Clustering

Clustering Basic Clustering Problem: Clustering is useful for: Distribute data into k different groups such that data points similar to each other are in the same group Similarity between points is defined in terms of some distance metric Clustering is useful for: Similarity/Dissimilarity analysis Analyze what data point in the sample are close to each other Dimensionality Reduction High dimensional data replaced with a group (cluster) label - In many respects clustering is a similar problem to classification

Clustering

Clustering

Distance Metrics Euclidean Distance, in some space (for our purposes, probably a feature space) Must fulfill three properties:

Distance Metrics Common simple metrics: Euclidean: Manhattan: Both work for an arbitrary k-dimensional space

Clustering Algorithms k-Nearest Neighbor k-Means Parzen Windows

k-Nearest Neighbor In essence, a classifier Requires input parameter k In this algorithm, k indicates the number of neighboring points to take into account when classifying a data point Requires training data

k-Nearest Neighbor Algorithm For each data point xn, choose its class by finding the most prominent class among the k nearest data points in the training set Use any distance measure (usually a Euclidean distance measure)

k-Nearest Neighbor Algorithm + e1 + q1 - + + - - Note: k=1 turns out to be a voronoi diagram 1-nearest neighbor: the concept represented by e1 5-nearest neighbors: q1 is classified as negative

k-Nearest Neighbor Advantages: Disadvantages: Simple General (can work for any distance measure you want) Disadvantages: Requires well classified training data Can be sensitive to k value chosen All attributes are used in classification, even ones that may be irrelevant Inductive bias: we assume that a data point should be classified the same as points near it

k-Means Suitable only when data points have continuous values Groups are defined in terms of cluster centers (means) Requires input parameter k In this algorithm, k indicates the number of clusters to be created Guaranteed to converge to at least a local optima

k-Means Algorithm Algorithm: Randomly initialize k mean values Repeat next two steps until no change in means: Partition the data using a similarity measure according to the current means Move the means to the center of the data in the current partition Stop when no change in the means Explain all of this in better terms… Data is assigned to whichever mean it is closest to Then we move means to represent center of its current set of data points

k-Means

k-Means Advantages: Disadvantages: Simple General (can work for any distance measure you want) Requires no training phase Disadvantages: Result is very sensitive to initial mean placement Can perform poorly on overlapping regions Doesn’t work on features with non-continuous values (can’t compute cluster means) Inductive bias: we assume that a data point should be classified the same as points near it

Parzen Windows Similar to k-Nearest Neighbor, but instead of using the k closest training data points, its uses all points within a kernel (window), weighting their contribution to the classification based on the kernel As with our classification algorithms, we will consider a gaussian kernel as the window

Parzen Windows Assume a region defined by a d-dimensional Gaussian of scale σ We can define a window density function: Note that we consider all points in the training set, but if a point is outside of the kernel, its weight will be 0, negating its influence

Parzen Windows Left – small window…high accuracy for given test sample, but VERY specific…probably no good for new data Right – much more general…for large set, probably more accurate

Parzen Windows Advantages: Disadvantages: More robust than k-nearest neighbor Excellent accuracy and consistency Disadvantages: How to choose the size of the window? Alone, kernel density estimation techniques provide little insight into data or problems