Príznaky Znižovanie dimenzie: viac príznakov => viac informácie, vyššia presnosť viac príznakov => zložitejšia extrakcia viac príznakov => zložitejší tréning.

Slides:



Advertisements
Similar presentations
Component Analysis (Review)
Advertisements

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Face Recognition Ying Wu Electrical and Computer Engineering Northwestern University, Evanston, IL
Dimensionality Reduction PCA -- SVD
Dimension reduction (1)
Minimum Redundancy and Maximum Relevance Feature Selection
« هو اللطیف » By : Atefe Malek. khatabi Spring 90.
Linear Discriminant Analysis
Principal Component Analysis CMPUT 466/551 Nilanjan Ray.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Dimensionality Reduction Chapter 3 (Duda et al.) – Section 3.8
Linear Methods for Regression Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
Subspace and Kernel Methods April 2004 Seong-Wook Joo.
© 2003 by Davi GeigerComputer Vision September 2003 L1.1 Face Recognition Recognized Person Face Recognition.
Dimensionality R e d u c t i o n. Another unsupervised task Clustering, etc. -- all forms of data modeling Trying to identify statistically supportable.
Principal Component Analysis
CS 790Q Biometrics Face Recognition Using Dimensionality Reduction PCA and LDA M. Turk, A. Pentland, "Eigenfaces for Recognition", Journal of Cognitive.
Dimensional reduction, PCA
Machine Learning (Recitation 1) By Jimeng Sun 9/15/05.
Dimension Reduction and Feature Selection Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Nearest Neighbor Classifiers other names: –instance-based learning –case-based learning (CBL) –non-parametric learning –model-free learning.
Jeff Howbert Introduction to Machine Learning Winter Machine Learning Feature Creation and Selection.
CS 485/685 Computer Vision Face Recognition Using Principal Components Analysis (PCA) M. Turk, A. Pentland, "Eigenfaces for Recognition", Journal of Cognitive.
Summarized by Soo-Jin Kim
Machine Learning CS 165B Spring Course outline Introduction (Ch. 1) Concept learning (Ch. 2) Decision trees (Ch. 3) Ensemble learning Neural Networks.
Probability of Error Feature vectors typically have dimensions greater than 50. Classification accuracy depends upon the dimensionality and the amount.
Presented By Wanchen Lu 2/25/2013
1 Graph Embedding (GE) & Marginal Fisher Analysis (MFA) 吳沛勳 劉冠成 韓仁智
Feature extraction 1.Introduction 2.T-test 3.Signal Noise Ratio (SNR) 4.Linear Correlation Coefficient (LCC) 5.Principle component analysis (PCA) 6.Linear.
IEEE TRANSSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
Generalizing Linear Discriminant Analysis. Linear Discriminant Analysis Objective -Project a feature space (a dataset n-dimensional samples) onto a smaller.
Classification Course web page: vision.cis.udel.edu/~cv May 12, 2003  Lecture 33.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
1 Introduction to Kernel Principal Component Analysis(PCA) Mohammed Nasser Dept. of Statistics, RU,Bangladesh
CSE 185 Introduction to Computer Vision Face Recognition.
Principal Component Analysis Machine Learning. Last Time Expectation Maximization in Graphical Models – Baum Welch.
Discriminant Analysis
CSC2515: Lecture 7 (post) Independent Components Analysis, and Autoencoders Geoffrey Hinton.
PCA vs ICA vs LDA. How to represent images? Why representation methods are needed?? –Curse of dimensionality – width x height x channels –Noise reduction.
Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
MACHINE LEARNING 7. Dimensionality Reduction. Dimensionality of input Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Principal Component Analysis and Linear Discriminant Analysis for Feature Reduction Jieping Ye Department of Computer Science and Engineering Arizona State.
Supervised learning in high-throughput data  General considerations  Dimension reduction with outcome variables  Classification models.
LDA (Linear Discriminant Analysis) ShaLi. Limitation of PCA The direction of maximum variance is not always good for classification.
Bump Hunting The objective PRIM algorithm Beam search References: Feelders, A.J. (2002). Rule induction by bump hunting. In J. Meij (Ed.), Dealing with.
Dimension reduction (2) EDR space Sliced inverse regression Multi-dimensional LDA Partial Least Squares Network Component analysis.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)
Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:
1 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Machine learning, pattern recognition and statistical data modelling.
Principal Component Analysis (PCA)
LECTURE 11: Advanced Discriminant Analysis
School of Computer Science & Engineering
LECTURE 10: DISCRIMINANT ANALYSIS
Dimensionality Reduction
Machine Learning Dimensionality Reduction
Machine Learning Feature Creation and Selection
Outline Peter N. Belhumeur, Joao P. Hespanha, and David J. Kriegman, “Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection,”
PCA vs ICA vs LDA.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Dimensionality Reduction
Feature space tansformation methods
Generally Discriminant Analysis
LECTURE 09: DISCRIMINANT ANALYSIS
Feature Selection Methods
Presentation transcript:

Príznaky Znižovanie dimenzie: viac príznakov => viac informácie, vyššia presnosť viac príznakov => zložitejšia extrakcia viac príznakov => zložitejší tréning klasifikátora The curse of dimensionality Riešenie: zníženie počtu príznakov

Výber príznakov: vyberieme podmnožinu Redukcia príznakov: transformujeme pôvodnú množinu do menej-dimenzionálnej

Zoradenie príznakov podľa vhodnosti - ohodnotenie jednotlivých príznakov identifikácia relevantných príznakov, nevyhodnocuje sa nadbytočnosť príznakov Hľadanie vhodnej podmnožiny - ohodnotenie podmnožín príznakov identifikácia minimálnej podmnožiny príznakov, implicitne sa vyhodnocuje nadbytočnosť príznakov 2 d podmnožín d príznakov Typy algoritmov

Filter Separating feature selection from classifier learning Relying on general characteristics of data (information, distance, dependence, consistency) No bias toward any learning algorithm, fast Výber príznakov

Zoradenie príznakov podľa vhodnosti ohodnotenie jednotlivých príznakov výber najlepších Výhody Efektívnosť Ľahká implementácia Nevýhody ťažko určiť vhodný prah neuvažuje sa vzťah medzi príznakmi

Výber vhodných príznakov Forward1: N príznakov s najvyšším skóre Forward2: 1. vyber príznak s najvyšším skóre 2. – prerátaj skóre zvyšných príznakov – opakuj, kým nevyberieš N príznakov

Výber príznakov Backward1: Z množiny príznakov odstráň N príznakov s najnižším skóre Backward2: 1. Z množiny príznakov odstráň príznak s najnižším skóre 2. – prerátaj skóre zvyšných príznakov – opakuj, kým neodstrániš N príznakov

Hodnotiace miery Miery vhodnosti príznakov Filter: -Konzistencia -Medzitriedna vzdialenosť -Štatistická závislosť -Informačno-teoretické miery Wrapper: Prediktívna schopnosť množiny trénovacích príznakov (kvalita rozpoznávania pre testovacie dáta) krížová validácia

Konzistencia Podmnožina príznakov musí separovať triedy tak konzistentne ako celá množina Nekonzistencia, ak objekty s rovnakými príznakmi patria rôznym triedam

Štatistická závislosť Korelačný keoficient Závislosť → štatistická nadbytočnosť zdrojových dát Nekorelovanosť ≠ nezávislosť Iba ak X a Y majú normálne rozdelenie

-p*log(p)-(1-p)*log(1-p) Informačno-teoretické miery

Entropia H(X) = 1.5 H(Y) = 1 X = College Major Y = Likes “XBOX” XY MathYes HistoryNo CSYes MathNo MathNo CSYes HistoryNo MathYes log 2

Špecifická podmienená entropia X = College Major Y = Likes “XBOX” XY MathYes HistoryNo CSYes MathNo MathNo CSYes HistoryNo MathYes H(Y |X=v) = entropia len týchY, X =v H(Y|X=Math) = 1 H(Y|X=History) = 0 H(Y|X=CS) = 0

Podmienená entropia X = College Major Y = Likes “XBOX” XY MathYes HistoryNo CSYes MathNo MathNo CSYes HistoryNo MathYes H(Y|X) = priemerná špecifická podmienená entropia Y = Σ j P(X=v j ) H(Y | X = v j ) vjvjvjvj Prob(X=v j ) H(Y | X = v j ) Math0.51 History0.250 CS0.250 H(Y|X) =.5

Vzájomná informácia H(Y) = 1 H(Y|X) = 0.5 I(Y|X) = 0.5 X = College Major Y = Likes “XBOX” XY MathYes HistoryNo CSYes MathNo MathNo CSYes HistoryNo MathYes Ako sa znížia nároky (počet bitov) na prenos informácie Y, ak odosielateľ aj prijímateľ poznajú X? I(Y|X) = H(Y) - H(Y |X)

Wrapper Relying on a predetermined classification algorithm Using predictive accuracy as goodness measure High accuracy, computationally expensive Výber príznakov

Wrapper Learning algorithm is a black box computes objective function OF(s) Účelová funkcia Exhaustive search 2 d possible subsets Greedy search is common and effective Genetic algorithms …

Hľadanie optimálnej podmnožiny Backward elimination tends to find better models too expensive to fit the large sets at the beginning of search Both can be too greedy. Backward elimination Initialize s={1,2,…,n} Do: remove feature from s which improves OF(s) most While OF(s) can be improved Forward selection Initialize s={} Do: Add feature to s which improves OF(s) most While OF(s) can be improved

Ohodnotenie podmnožiny We’re not ultimately interested in training error; we’re interested in test error (error on new data). We can estimate test error by pretending we haven’t seen some of our data. Keep some data aside as a validation set. If we don’t use it in training, then it’s a fair test of our model.

K-fold cross validation Rozdeľ dáta na K skupín Každú skupinu použi na validáciu Zisti priemernú chybu X1X1 Learn X2X2 X3X3 X4X4 X5X5 X6X6 X7X7 test

X1X1 Learn X2X2 X3X3 X4X4 X5X5 X6X6 X7X7 test K-fold cross validation Rozdeľ dáta na K skupín Každú skupinu použi na validáciu Zisti priemernú chybu

X1X1 … Learn X2X2 X3X3 X4X4 X5X5 X6X6 X7X7 test K-fold cross validation Rozdeľ dáta na K skupín Každú skupinu použi na validáciu Zisti priemernú chybu

X1X1 Learn X2X2 X3X3 X4X4 X5X5 X6X6 X7X7 K-fold cross validation Rozdeľ dáta na K skupín Každú skupinu použi na validáciu Zisti priemernú chybu OF

Feature Reduction Algorithms Unsupervised (minimize the information loss) Latent Semantic Indexing (LSI): truncated SVD Independent Component Analysis (ICA) Principal Component Analysis (PCA) Manifold learning algorithms (a manifold is a topological space which is locally Euclidean) - Nonlinear Supervised (maximize the class discrimination) Linear Discriminant Analysis (LDA) Canonical Correlation Analysis (CCA) Partial Least Squares (PLS)

Principal Component Analysis (PCA) Karhunen-Loeve or K-L method PCA finds the best “subspace” that captures as much data variance as possible Based on eigen-decomposition of data covariance matrix Very simple! Data can be represented as linear combination of features

PCA otočí súradnicovú sústavu tak, aby prvá os bola v smere najväčšej variability a ďalšie boli na ňu kolmé v smeroch najväčšej zvyšnej variability. nová ortonormálna báza

Very Nice When Initial Dimension Not Too Big What if very large dimensional data? Images e.g., (d ~10 4 ) Problem: Covariance matrix Σ is size (d x d) d=10 4 | Σ | = 10 8 Singular Value Decomposition (SVD) to the rescue!

SVD Singulárne číslo a singulárne vektory matice pre reálne matice

Vzťah medzi PCA a SVD použitie SVD namiesto PCA

ICA (Independent Components Analysis) Relaxes the constraint of orthogonality but keeps the linearity. Thus, could be more flexible than PCA in finding patterns.

PCA is not always an optimal dimensionality-reduction procedure for classification purposes. PCA is based on the sample covariance which characterizes the scatter of the entire data set, irrespective of class-membership. The projection axes chosen by PCA might not provide good discrimination power.

Linear Discriminant Analysis (LDA) What is the goal of LDA? Perform dimensionality reduction “while preserving as much of the class discriminatory information as possible”. Seeks to find directions along which the classes are best separated. Takes into consideration the scatter within-classes but also the scatter between-classes.

Fisherova lineárna disriminačná analýza riadená metóda využíva informáciu o klasifikačných triedach

Variability premietnutých príznakov

PCA is first applied to the data set to reduce its dimensionality. LDA is then applied to find the most discriminative directions:

Case Study: PCA versus LDA A. Martinez, A. Kak, "PCA versus LDA", IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, no. 2, pp , Is LDA always better than PCA? There has been a tendency in the computer vision community to prefer LDA over PCA. This is mainly because LDA deals directly with discrimination between classes while PCA does not pay attention to the underlying class structure. Main results of this study: (1) When the training set is small, PCA can outperform LDA. (2) When the number of samples is large and representative for each class, LDA outperforms PCA.

LDA is a parametric method since it assumes unimodal Gaussian likelihoods If the distributions are significantly non-Gaussian, the LDA projections will not be able to preserve any complex structure of the data, which may be needed for classification LDA will fail when the discriminatory information is not in the mean but rather in the variance Nevýhody LDA?

Deficiencies of Linear Methods Data may not be best summarized by linear combination of features