Machine Learning CS 165B Spring 2012 1. Course outline Introduction (Ch. 1) Concept learning (Ch. 2) Decision trees (Ch. 3) Ensemble learning Neural Networks.

Slides:



Advertisements
Similar presentations
Text mining Gergely Kótyuk Laboratory of Cryptography and System Security (CrySyS) Budapest University of Technology and Economics
Advertisements

Eigen Decomposition and Singular Value Decomposition
Eigen Decomposition and Singular Value Decomposition
Component Analysis (Review)
Face Recognition Ying Wu Electrical and Computer Engineering Northwestern University, Evanston, IL
Dimensionality Reduction PCA -- SVD
Dimension reduction (1)
1er. Escuela Red ProTIC - Tandil, de Abril, 2006 Principal component analysis (PCA) is a technique that is useful for the compression and classification.
Principal Component Analysis CMPUT 466/551 Nilanjan Ray.
Dimensionality reduction. Outline From distances to points : – MultiDimensional Scaling (MDS) – FastMap Dimensionality Reductions or data projections.
Principal Component Analysis
An Introduction to Kernel-Based Learning Algorithms K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda and B. Scholkopf Presented by: Joanna Giforos CS8980: Topics.
Dimensional reduction, PCA
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
E.G.M. PetrakisDimensionality Reduction1  Given N vectors in n dims, find the k most important axes to project them  k is user defined (k < n)  Applications:
DATA MINING LECTURE 7 Dimensionality Reduction PCA – SVD
Ch. 10: Linear Discriminant Analysis (LDA) based on slides from
SVD(Singular Value Decomposition) and Its Applications
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Summarized by Soo-Jin Kim
Dimensionality Reduction: Principal Components Analysis Optional Reading: Smith, A Tutorial on Principal Components Analysis (linked to class webpage)
Chapter 2 Dimensionality Reduction. Linear Methods
Probability of Error Feature vectors typically have dimensions greater than 50. Classification accuracy depends upon the dimensionality and the amount.
Presented By Wanchen Lu 2/25/2013
Feature extraction 1.Introduction 2.T-test 3.Signal Noise Ratio (SNR) 4.Linear Correlation Coefficient (LCC) 5.Principle component analysis (PCA) 6.Linear.
Principal Component Analysis Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Classification Course web page: vision.cis.udel.edu/~cv May 12, 2003  Lecture 33.
SINGULAR VALUE DECOMPOSITION (SVD)
Principal Component Analysis (PCA). Data Reduction summarization of data with many (p) variables by a smaller set of (k) derived (synthetic, composite)
1 Introduction to Kernel Principal Component Analysis(PCA) Mohammed Nasser Dept. of Statistics, RU,Bangladesh
Discriminant Analysis
Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable;
EIGENSYSTEMS, SVD, PCA Big Data Seminar, Dedi Gadot, December 14 th, 2014.
PCA vs ICA vs LDA. How to represent images? Why representation methods are needed?? –Curse of dimensionality – width x height x channels –Noise reduction.
Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.
Dimensionality reduction
MACHINE LEARNING 7. Dimensionality Reduction. Dimensionality of input Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Principal Component Analysis and Linear Discriminant Analysis for Feature Reduction Jieping Ye Department of Computer Science and Engineering Arizona State.
Feature Extraction 主講人:虞台文. Content Principal Component Analysis (PCA) PCA Calculation — for Fewer-Sample Case Factor Analysis Fisher’s Linear Discriminant.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Feature Extraction 主講人:虞台文.
Principal Components Analysis ( PCA)
Unsupervised Learning II Feature Extraction
Unsupervised Learning II Feature Extraction
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:
CSE 4705 Artificial Intelligence
Principal Component Analysis (PCA)
PREDICT 422: Practical Machine Learning
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
School of Computer Science & Engineering
LECTURE 10: DISCRIMINANT ANALYSIS
Principal Component Analysis (PCA)
Machine Learning Dimensionality Reduction
LSI, SVD and Data Management
Principal Component Analysis
PCA vs ICA vs LDA.
Principal Component Analysis
Recitation: SVD and dimensionality reduction
Dimensionality Reduction
Feature space tansformation methods
Generally Discriminant Analysis
LECTURE 09: DISCRIMINANT ANALYSIS
Feature Selection Methods
Principal Component Analysis
Linear Discrimination
Marios Mattheakis and Pavlos Protopapas
Presentation transcript:

Machine Learning CS 165B Spring

Course outline Introduction (Ch. 1) Concept learning (Ch. 2) Decision trees (Ch. 3) Ensemble learning Neural Networks (Ch. 4) Linear classifiers Support Vector Machines Bayesian Learning (Ch. 6) Bayesian Networks Clustering Computational learning theory 2 Midterm on Wednesday

3 Midterm Wednesday May 2 Topics (till today’s lecture) Content –(40%) Short questions –(20%) Concept learning and hypothesis spaces –(20%) Decision trees –(20%) Artificial Neural Networks Practice midterm will be posted today Can bring one regular 2-sided sheet & calculator

4 Background on Probability & Statistics Random variable, sample space, event (union, intersection) Probability distribution –Discrete (pmf) –Continuous (pdf) –Cumulative (cdf) Conditional probability –Bayes Rule –P(C ≥ 2 | M = 0) Independence of random variables –Are C and M independent? Choose which of two envelopes contains a higher number –Allowed to peak at one of them 3 coins C is the count of heads M =1 iff all coins match

5 Background on Probability & Statistics Common distributions –Bernoulli –Uniform –Binomial –Gaussian (Normal) –Poisson Expected value, variance, standard deviation

Approaches to classification Discriminant functions: –Learn the boundary between classes. Infer conditional class probabilities: –Choose the most probable class 6 What kind of classifier is logistic regression?

Discriminant Functions They can be arbitrary functions of x, such as: Nearest Neighbor Decision Tree Linear Functions Nonlinear Functions 7 Sometimes, transform the data and then learn a linear function

High-dimensional data Gene expression Face imagesHandwritten digits 8

Why feature reduction? Most machine learning and data mining techniques may not be effective for high-dimensional data –Curse of Dimensionality –Query accuracy and efficiency degrade rapidly as the dimension increases. The intrinsic dimension may be small. –For example, the number of genes responsible for a certain type of disease may be small. 9

Why feature reduction? Visualization: projection of high-dimensional data onto 2D or 3D. Data compression: efficient storage and retrieval. Noise removal: positive effect on query accuracy. 10

Applications of feature reduction Face recognition Handwritten digit recognition Text mining Image retrieval Microarray data analysis Protein classification 11

Feature reduction algorithms Unsupervised –Latent Semantic Indexing (LSI): truncated SVD –Independent Component Analysis (ICA) –Principal Component Analysis (PCA) Supervised –Linear Discriminant Analysis (LDA) 12

13 Principal Component Analysis (PCA) Summarization of data with many variables by a smaller set of derived (synthetic, composite) variables PCA based on SVD –So, look at SVD first

14 Singular Value Decomposition (SVD) Intuition: find the axis that shows the greatest variation, and project all points to this axis f1 e1 e2 f2 14

SVD: mathematical formulation Let A be an m x n real matrix of m n-dimensional points SVD decomposition –A = U x  x V T –U(m x m) is orthogonal: U T U = I –V(n x n) is orthogonal: V T V = I –  (m x n) has r positive non-zero singular values in descending order on its diagonal Columns of U are the orthogonal eigenvectors of AA T (called the left singular vectors of A) –AA T = (U x  x V T ) (U x  x V T ) T = U x  x  T  x U T = U x  2  x U T Columns of V are the orthogonal eigenvectors of A T A (called the right singular vectors of A) –A T A = (U x  x V T ) T (U x  x V T ) = V x  T  x  x V T = V x  2  x V T  contains the square root of the eigenvalues of AA T (or A T A) –These are called the singular values (positive real) –r is the rank of A, AA T, A T A U defines the column space of A, V the row space. 15

SVD - example 16

SVD - example A = U  V T = xx v1 17

SVD - example A = U  V T = xx variance (‘spread’) on the v1 axis 18

Dimensionality reduction = xx 19

Dimensionality reduction set the smallest singular values to zero: = xx 20

Dimensionality reduction ~ xx 21

Dimensionality reduction ~ xx 22

Dimensionality reduction ~ xx 23

Dimensionality reduction ~ 24

Dimensionality reduction ‘spectral decomposition’ of the matrix: = xx 25

Dimensionality reduction ‘spectral decomposition’ of the matrix: = xx u1u1 u2u2 1 2 v1Tv1T v2Tv2T 26

Dimensionality reduction ‘spectral decomposition’ of the matrix: =u1u1 1 v1Tv1T u2u2 2 v2Tv2T m n 27

Dimensionality reduction ‘spectral decomposition’ of the matrix: =u1u1 1 vT1vT1 u2u2 2 vT2vT m n m x 1 1 x n r terms 28

Dimensionality reduction approximation / dim. reduction: by keeping the first few terms (how many?) =u1u1 1 vT1vT1 u2u2 2 vT2vT n m assume: 1 >= 2 >=... 29

Dimensionality reduction A heuristic: keep 80-90% of ‘energy’ (= sum of squares of i ’s) =u1u1 1 vT1vT1 u2u2 2 vT2vT n m assume: 1 >= 2 >=... 30

Dimensionality reduction Matrix V in the SVD decomposition (A = UΛV T ) is used to transform the data. AV (= U Λ ) defines the transformed dataset. For a new data element x, xV defines the transformed data. Keeping the first k (k < n) dimensions, amounts to keeping only the first k columns of V. 31

Let A = U  V T A = ∑ λ i u i v i T The Frobenius norm of an m x n matrix M is Let A k = the above summation using the k largest eigenvalues. Theorem: [Eckart and Young] Among all m x n matrices B of rank at most k, we have that: “Residual” variation is information in A that is not retained. Balancing act between –clarity of representation, ease of understanding –oversimplification: loss of important or relevant information. Optimality of SVD FF k BAAA    2 ],[jiAA F = √ λi2√ λi2 22 k BAAA  32

Principal Components Analysis (PCA) Transfer the dataset to the center by subtracting the means: let matrix A be the result. Compute the covariance matrix A T A. Project the dataset along a subset of the eigenvectors of A T A. Matrix V in the SVD decomposition contains these. Also known as K-L transform. 33

Principal Component Analysis (PCA) Takes a data matrix of m objects by n variables, which may be correlated, and summarizes it by uncorrelated axes (principal components or principal axes) that are linear combinations of the original n variables The first k components display as much as possible of the variation among objects. 34

2D Example of PCA 35

Configuration is Centered each variable is adjusted to a mean of zero (by subtracting the mean from each value). 36

Principal Components are Computed PC 1 has the highest possible variance (9.88) PC 2 has a variance of 3.03 PC 1 and PC 2 have zero covariance. 37

PC 1 PC 2 Each principal axis is a linear combination of the original two variables 38

Feature reduction algorithms Unsupervised –Latent Semantic Indexing (LSI): truncated SVD –Independent Component Analysis (ICA) –Principal Component Analysis (PCA) Supervised –Linear Discriminant Analysis (LDA) 39

Course outline Introduction (Ch. 1) Concept learning (Ch. 2) Decision trees (Ch. 3) Ensemble learning Neural Networks (Ch. 4) Linear classifiers Support Vector Machines Bayesian Learning (Ch. 6) Bayesian Networks Clustering Computational learning theory 40

41 Midterm analysis Grade distribution Solution to ANN problem Makeup problem on Wednesday –20 minutes –15 points –Bring a calculator

Fisher’s linear discriminant A simple linear discriminant function is a projection of the data down to 1-D. –So choose the projection that gives the best separation of the classes. What do we mean by “best separation”? An obvious direction to choose is the direction of the line joining the class means. –But if the main direction of variance in each class is not orthogonal to this line, this will not give good separation (see the next figure). Fisher’s method chooses the direction that maximizes the ratio of between class variance to within class variance. –This is the direction in which the projected points contain the most information about class membership (under Gaussian assumptions) 42

Fisher’s linear discriminant When projected onto the line joining the class means, the classes are not well separated. Fisher chooses a direction that makes the projected classes much tighter, even though their projected means are less far apart. 43

Fisher’s linear discriminant (derivation) Find the best direction w for accurate classification. A measure of the separation between the projected points is the difference of the sample means. If m i is the d-dimensional sample mean from D i given by the difference of the projected sample means is: the sample mean from the projected points Y i given by 44

Fisher’s linear discriminant (derivation) Define scatter for the projection: Choose w in order to maximize Define scatter matrices S i (i = 1, 2) and S w by is called the total within-class scatter. 45

Fisher’s linear discriminant (derivation) We obtain 46

Fisher’s linear discriminant (derivation) where In terms of S B and S w, J(w) can be written as: 47

Fisher’s linear discriminant (derivation) 48

Fisher’s linear discriminant (derivation) A vector w that maximizes J(w) must satisfy In the case that S w is nonsingular, 49

50 Linear discriminant Advantages: –Simple: O(d) space/computation –Knowledge extraction: weighted sum of attributes; positive/negative weights, magnitudes (credit scoring)

51 Non-linear models Quadratic discriminant: Higher-order (product) terms: Map from x to z using nonlinear basis functions and use a linear discriminant in z-space

52 Linear model: two classes

53 Geometry of classification w is orthogonal to the decision surface D = distance of decision surface from origin Consider any point x on the decision surface. Then D = w T x / ||w|| = −b / ||w|| w 0 = b d(x) = distance of x from decision surface x = x p + d(x) w/||w|| w T x + b = w T x p + d(x) w T w/||w|| + b g(x) = (w T x p + b) + d(x) ||w|| d(x) = g(x) / ||w|| = w T x / ||w|| − D