Download presentation
Presentation is loading. Please wait.
Published byMark Mills Modified over 8 years ago
1
1 Bios 760R, Lecture 1 Overview Overview of the course Classification and Clustering The “curse of dimensionality” Reminder of some background knowledge Tianwei Yu RSPH Room 334 Tianwei.yu@emory.edu
2
Instructor: Tianwei Yu Office: GCR Room 334 Email: tianwei.yu@emory.edutianwei.yu@emory.edu Office Hours: by appointment. Teaching Assistant: Mr. Qingpo Cai Office Hours: TBA Course Website: http://web1.sph.emory.edu/users/tyu8/740/index.htm Course Outline
3
3 Overview Focus of the course: Classification Clustering Dimension reduction The Clustering lectures will be given by Dr. Elizabeth Chong. Lecture 1Overview Lecture 2Bayesian decision theory Lecture 3Similarity measures, Clustering (1) Lecture 4Clustering (2) Lecture 5Density estimation and classification Lecture 6Linear machine Lecture 7Support vector machine Lecture 8Generalized additive model Lecture 9Boosting Lecture 10Tree and forest Lecture 11Bump hunting; Neural network (1) Lecture 12Neural network (2) Lecture 13Model generalization Lecture 14Dimension reduction (1) Lecture 15Dimension reduction (2) Lecture 16Some applications
4
4 Overview References: Textbook: The elements of statistical learning. Hastie, Tibshirani & Friedman. Free at: http://statweb.stanford.edu/~tibs/ElemStatLearn/ Other references: Pattern classification. Duda, Hart & Stork. Data clustering: theory, algorithms and application. Gan, Ma & Wu. Applied multivariate statistical analysis. Johnson & Wichern. Evaluation: Three projects (30% each) Class participation (10%)
5
5 Machine Learning /Data mining Supervised learning ”direct data mining” Unsupervised learning ”indirect data mining” Classification Estimation Prediction Clustering Association rules Description, dimension reduction and visualization Modified from Figure 1.1 from by Gan, Ma and Wu Overview Semi-supervised learning
6
6 In supervised learning, the problem is well-defined: Given a set of observations {x i, y i }, estimate the density Pr(Y, X) Usually the goal is to find the model/parameters to minimize a loss, A common loss is Expected Prediction Error: It is minimized at Objective criteria exists to measure the success of a supervised learning mechanism. Overview
7
7 In unsupervised learning, there is no output variable, all we observe is a set {x i }. The goal is to infer Pr(X) and/or some of its properties. When the dimension is low, nonparametric density estimation is possible; When the dimension is high, may need to find simple properties without density estimation, or apply strong assumptions to estimate the density. There is no objective criteria from the data itself; to justify a result:> Heuristic arguments, > External information, > Evaluate based on properties of the data Overview
8
8 Classification The general scheme. An example.
9
9 Classification In most cases, a single feature is not enough to generate a good classifier.
10
10 Classification Two extremes: overly rigid and overly flexible classifiers.
11
11 Classification Goal: an optimal trade-off between model simplicity and training set performance. This is similar to the AIC / BIC / …… model selection in regression.
12
12 Classification An example of the overall scheme involving classification:
13
13 Classification A classification project: a systematic view.
14
14 Assign observations into clusters, such that those within each cluster are more closely related to one another than objects assigned to different clusters. Clustering Detect data relations Find natural hierarchy Ascertain the data consists of distinct subgroups …...
15
15 Clustering Mathematically, we hope to estimate the number of clusters k, and the membership matrix U In fuzzy clustering, we have
16
16 Clustering Some clusters are well-represented by center+spread model; Some are not.
17
17 Curse of Dimensionality Bellman R.E., 1961. In p-dimensions, to get a hypercube with volume r, the edge length needed is r 1/p. In 10 dimensions, to capture 1% of the data to get a local average, we need 63% of the range of each input variable.
18
18 In other words, To get a “dense” sample, if we need N=100 samples in 1 dimension, then we need N=100 10 samples in 10 dimensions. In high-dimension, the data is always sparse and do not support density estimation. More data points are closer to the boundary, rather than to any other data point prediction is much harder near the edge of the training sample. Curse of Dimensionality
19
19 Curse of Dimensionality Estimating a 1D density with 40 data points. Standard normal distribution.
20
20 Curse of Dimensionality Estimating a 2D density with 40 data points. 2D normal distribution; zero mean; variance matrix is identity matrix.
21
21 Curse of Dimensionality Another example – the EPE of the nearest neighbor predictor. To find E(Y|X=x), take the average of data points close to a given x, i.e. the top k nearest neighbors of x Assumes f(x) is well-approximated by a locally constant function When N is large, the neighborhood is small, the prediction is accurate.
22
22 Curse of Dimensionality Just a reminder, the expected prediction error contains variance and bias components. Under model: Y=f(X)+ε
23
23 Curse of Dimensionality Data: Uniform in [−1, 1] p
24
24 Curse of Dimensionality
25
25 Curse of Dimensionality We have talked about the curse of dimensionality in the sense of density estimation. In a classification problem, we do not necessarily need density estimation. Generative model --- care about the mechanism: class density function. Learns p(X, y), and predict using p(y|X). In high dimensions, this is difficult. Discriminative model --- care about boundary. Learns p(y|X) directly, potentially with a subset of X.
26
26 Curse of Dimensionality Example: Classifying belt fish and carp. Looking at the length/width ratio is enough. Why should we care how many teeth each kind of fish have, or what shape fins they have? X1X2X3…X1X2X3… y Generative model y Discriminative model
27
27 Reminder of some results for random vectors The multivariate Gaussian distribution: The covariance matrix, not limited to Gaussian. * Gaussian is fully defined by mean vector and covariance matrix (first and second moments). Wiki
28
28 The correlation matrix: Relationship with covariance matrix: Reminder of some results for random vectors
29
29 Reminder of some results for random vectors Quadratic form: A linear combination of the elements of a random vector with mean μ and variance-covariance matrix Σ: 2-D example:
30
30 Reminder of some results for random vectors A “new” random vector generated from linear combinations of a random vector:
31
Reminder of some results for random vectors Let A be a kxk square symmetrix matrix, then it has k pairs of eigenvalues and eigenvectors. A can be decomposed as: Positive-definite matrix:
32
Reminder of some results for random vectors Square root matrix of a positive-definite matrix: Inverse of a positive-definite matrix:
33
33 Reminder of some results for random vectors
34
34 Reminder of some results for random vectors Proof of the first (and second) point of the previous slide.
35
35 Reminder of some results for random vectors With a sample of the random vector:
36
36 Reminder of some results for random vectors To estimate mean vector and covariance matrix:
37
J Clin Pathol 2009;62:1-5 Reminder of the ROC curve
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.