1 Bios 760R, Lecture 1 Overview  Overview of the course  Classification and Clustering  The “curse of dimensionality”  Reminder of some background.

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) ETHEM ALPAYDIN © The MIT Press, 2010
Pattern Recognition and Machine Learning
CS Statistical Machine learning Lecture 13 Yuan (Alan) Qi Purdue CS Oct
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
Dimension reduction (1)
Model generalization Test error Bias, variance and complexity
An Overview of Machine Learning
Pattern Classification Chapter 2 (Part 2)0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O.
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
Classification and risk prediction
Prénom Nom Document Analysis: Parameter Estimation for Pattern Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Lecture Notes for CMPUT 466/551 Nilanjan Ray
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Statistical Decision Theory, Bayes Classifier
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Optimization of Signal Significance by Bagging Decision Trees Ilya Narsky, Caltech presented by Harrison Prosper.
Learning From Data Chichang Jou Tamkang University.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications PART 2: Statistical Pattern Classification: Optimal Classification.
1 Statistical Tools for Multivariate Six Sigma Dr. Neil W. Polhemus CTO & Director of Development StatPoint, Inc.
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Gaussian process modelling
0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,
Principles of Pattern Recognition
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
2. Bayes Decision Theory Prof. A.L. Yuille Stat 231. Fall 2004.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
Computational Intelligence: Methods and Applications Lecture 12 Bayesian decisions: foundation of learning Włodzisław Duch Dept. of Informatics, UMK Google:
1 E. Fatemizadeh Statistical Pattern Recognition.
Data Reduction via Instance Selection Chapter 1. Background KDD  Nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable.
Image Modeling & Segmentation Aly Farag and Asem Ali Lecture #2.
Optimal Bayes Classification
Principal Component Analysis Machine Learning. Last Time Expectation Maximization in Graphical Models – Baum Welch.
Lecture3 – Overview of Supervised Learning Rice ELEC 697 Farinaz Koushanfar Fall 2006.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Data Mining and Decision Support
Unsupervised learning  Supervised and Unsupervised learning  General considerations  Clustering  Dimension reduction The lecture is partly based on:
LECTURE 05: CLASSIFICATION PT. 1 February 8, 2016 SDS 293 Machine Learning.
Pattern Classification All materials in these slides* were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Computer Vision Lecture 7 Classifiers. Computer Vision, Lecture 6 Oleh Tretiak © 2005Slide 1 This Lecture Bayesian decision theory (22.1, 22.2) –General.
Dimension reduction (2) EDR space Sliced inverse regression Multi-dimensional LDA Partial Least Squares Network Component analysis.
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
1 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Machine learning, pattern recognition and statistical data modelling.
Usman Roshan Dept. of Computer Science NJIT
Lecture 2. Bayesian Decision Theory
PREDICT 422: Practical Machine Learning
IMAGE PROCESSING RECOGNITION AND CLASSIFICATION
Machine learning, pattern recognition and statistical data modelling
The Elements of Statistical Learning
Machine Learning Basics
Special Topics in Data Mining Applications Focus on: Text Mining
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Introduction to Predictive Modeling
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
The basics of Bayes decision theory
Pattern Recognition and Machine Learning
Generally Discriminant Analysis
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Parametric Methods Berlin Chen, 2005 References:
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
Machine Learning – a Probabilistic Perspective
Presentation transcript:

1 Bios 760R, Lecture 1 Overview  Overview of the course  Classification and Clustering  The “curse of dimensionality”  Reminder of some background knowledge Tianwei Yu RSPH Room 334

Instructor: Tianwei Yu Office: GCR Room Office Hours: by appointment. Teaching Assistant: Mr. Qingpo Cai Office Hours: TBA Course Website: Course Outline

3 Overview Focus of the course: Classification Clustering Dimension reduction The Clustering lectures will be given by Dr. Elizabeth Chong. Lecture 1Overview Lecture 2Bayesian decision theory Lecture 3Similarity measures, Clustering (1) Lecture 4Clustering (2) Lecture 5Density estimation and classification Lecture 6Linear machine Lecture 7Support vector machine Lecture 8Generalized additive model Lecture 9Boosting Lecture 10Tree and forest Lecture 11Bump hunting; Neural network (1) Lecture 12Neural network (2) Lecture 13Model generalization Lecture 14Dimension reduction (1) Lecture 15Dimension reduction (2) Lecture 16Some applications

4 Overview References: Textbook: The elements of statistical learning. Hastie, Tibshirani & Friedman. Free at: Other references: Pattern classification. Duda, Hart & Stork. Data clustering: theory, algorithms and application. Gan, Ma & Wu. Applied multivariate statistical analysis. Johnson & Wichern. Evaluation: Three projects (30% each) Class participation (10%)

5 Machine Learning /Data mining Supervised learning ”direct data mining” Unsupervised learning ”indirect data mining”  Classification  Estimation  Prediction  Clustering  Association rules  Description, dimension reduction and visualization Modified from Figure 1.1 from by Gan, Ma and Wu Overview Semi-supervised learning

6 In supervised learning, the problem is well-defined: Given a set of observations {x i, y i }, estimate the density Pr(Y, X) Usually the goal is to find the model/parameters to minimize a loss, A common loss is Expected Prediction Error: It is minimized at Objective criteria exists to measure the success of a supervised learning mechanism. Overview

7 In unsupervised learning, there is no output variable, all we observe is a set {x i }. The goal is to infer Pr(X) and/or some of its properties. When the dimension is low, nonparametric density estimation is possible; When the dimension is high, may need to find simple properties without density estimation, or apply strong assumptions to estimate the density. There is no objective criteria from the data itself; to justify a result:> Heuristic arguments, > External information, > Evaluate based on properties of the data Overview

8 Classification The general scheme. An example.

9 Classification In most cases, a single feature is not enough to generate a good classifier.

10 Classification Two extremes: overly rigid and overly flexible classifiers.

11 Classification Goal: an optimal trade-off between model simplicity and training set performance. This is similar to the AIC / BIC / …… model selection in regression.

12 Classification An example of the overall scheme involving classification:

13 Classification A classification project: a systematic view.

14 Assign observations into clusters, such that those within each cluster are more closely related to one another than objects assigned to different clusters. Clustering  Detect data relations  Find natural hierarchy  Ascertain the data consists of distinct subgroups  …...

15 Clustering Mathematically, we hope to estimate the number of clusters k, and the membership matrix U In fuzzy clustering, we have

16 Clustering Some clusters are well-represented by center+spread model; Some are not.

17 Curse of Dimensionality Bellman R.E., In p-dimensions, to get a hypercube with volume r, the edge length needed is r 1/p. In 10 dimensions, to capture 1% of the data to get a local average, we need 63% of the range of each input variable.

18 In other words, To get a “dense” sample, if we need N=100 samples in 1 dimension, then we need N= samples in 10 dimensions. In high-dimension, the data is always sparse and do not support density estimation. More data points are closer to the boundary, rather than to any other data point  prediction is much harder near the edge of the training sample. Curse of Dimensionality

19 Curse of Dimensionality Estimating a 1D density with 40 data points. Standard normal distribution.

20 Curse of Dimensionality Estimating a 2D density with 40 data points. 2D normal distribution; zero mean; variance matrix is identity matrix.

21 Curse of Dimensionality Another example – the EPE of the nearest neighbor predictor. To find E(Y|X=x), take the average of data points close to a given x, i.e. the top k nearest neighbors of x Assumes f(x) is well-approximated by a locally constant function When N is large, the neighborhood is small, the prediction is accurate.

22 Curse of Dimensionality Just a reminder, the expected prediction error contains variance and bias components. Under model: Y=f(X)+ε

23 Curse of Dimensionality Data: Uniform in [−1, 1] p

24 Curse of Dimensionality

25 Curse of Dimensionality We have talked about the curse of dimensionality in the sense of density estimation. In a classification problem, we do not necessarily need density estimation. Generative model --- care about the mechanism: class density function. Learns p(X, y), and predict using p(y|X). In high dimensions, this is difficult. Discriminative model --- care about boundary. Learns p(y|X) directly, potentially with a subset of X.

26 Curse of Dimensionality Example: Classifying belt fish and carp. Looking at the length/width ratio is enough. Why should we care how many teeth each kind of fish have, or what shape fins they have? X1X2X3…X1X2X3… y Generative model y Discriminative model

27 Reminder of some results for random vectors The multivariate Gaussian distribution: The covariance matrix, not limited to Gaussian. * Gaussian is fully defined by mean vector and covariance matrix (first and second moments). Wiki

28 The correlation matrix: Relationship with covariance matrix: Reminder of some results for random vectors

29 Reminder of some results for random vectors Quadratic form: A linear combination of the elements of a random vector with mean μ and variance-covariance matrix Σ: 2-D example:

30 Reminder of some results for random vectors A “new” random vector generated from linear combinations of a random vector:

Reminder of some results for random vectors Let A be a kxk square symmetrix matrix, then it has k pairs of eigenvalues and eigenvectors. A can be decomposed as: Positive-definite matrix:

Reminder of some results for random vectors Square root matrix of a positive-definite matrix: Inverse of a positive-definite matrix:

33 Reminder of some results for random vectors

34 Reminder of some results for random vectors Proof of the first (and second) point of the previous slide.

35 Reminder of some results for random vectors With a sample of the random vector:

36 Reminder of some results for random vectors To estimate mean vector and covariance matrix:

J Clin Pathol 2009;62:1-5 Reminder of the ROC curve