Download presentation
Presentation is loading. Please wait.
Published bySharleen Walton Modified over 9 years ago
1
Object Orie’d Data Analysis, Last Time Classical Discrimination (aka Classification) –FLD & GLR very attractive –MD never better, sometimes worse HDLSS Discrimination –FLD & GLR fall apart –MD much better Maximal Data Piling –HDLSS space is a strange place
2
Kernel Embedding Aizerman, Braverman and Rozoner (1964) Motivating idea: Extend scope of linear discrimination, By adding nonlinear components to data (embedding in a higher dim ’ al space) Better use of name: nonlinear discrimination?
3
Kernel Embedding Stronger effects for higher order polynomial embedding: E.g. for cubic, linear separation can give 4 parts (or fewer)
4
Kernel Embedding General View: for original data matrix: add rows: i.e. embed in Then Higher slice Dimensional with a Space hyperplane
5
Kernel Embedding EmbeddedFisher Linear Discrimination: Choose Class 1, for any when: in embedded space. image of class boundaries in original space is nonlinear allows more complicated class regions Can also do Gaussian Lik. Rat. (or others) Compute image by classifying points from original space
6
Kernel Embedding Visualization for Toy Examples: Have Linear Disc. In Embedded Space Study Effect in Original Data Space Via Implied Nonlinear Regions Approach: Use Test Set in Original Space (dense equally spaced grid) Apply embedded discrimination Rule Color Using the Result
7
Kernel Embedding Polynomial Embedding, Toy Example 1: Parallel Clouds
8
Kernel Embedding Polynomial Embedding, Toy Example 1: Parallel Clouds PC 1:PC 1 –always bad –finds “ embedded greatest var. ” only) FLD:FLD –stays good GLR:GLR –OK discrimination at data –but overfitting problems
9
Kernel Embedding Polynomial Embedding, Toy Example 2: Split X
10
Kernel Embedding Polynomial Embedding, Toy Example 2: Split X FLD:FLD –Rapidly improves with higher degree GLR:GLR –Always good –but never ellipse around blues …
11
Kernel Embedding Polynomial Embedding, Toy Example 3: Donut
12
Kernel Embedding Polynomial Embedding, Toy Example 3: Donut FLD:FLD –Poor fit for low degree –then good –no overfit GLR:GLR –Best with No Embed, –Square shape for overfitting?
13
Kernel Embedding Drawbacks to polynomial embedding: too many extra terms create spurious structure i.e. have “ overfitting ” HDLSS problems typically get worse
14
Kernel Embedding Hot Topic Variation: “ Kernel Machines ” Idea: replace polynomials by other nonlinear functions e.g. 1: sigmoid functions from neural nets e.g. 2: radial basis functions Gaussian kernels Related to “ kernel density estimation ” kernel density estimation (recall: smoothed histogram)
15
Kernel Embedding Radial Basis Functions: Note: there are several ways to embed: Na ï ve Embedding (equally spaced grid) Explicit Embedding (evaluate at data) Implicit Emdedding (inner prod. based) (everybody currently does the latter)
16
Kernel Embedding Na ï ve Embedding, Radial basis functions: At some “ grid points ”, For a “ bandwidth ” (i.e. standard dev ’ n), Consider ( dim ’ al) functions: Replace data matrix with:
17
Kernel Embedding Na ï ve Embedding, Radial basis functions: For discrimination: Work in radial basis space, With new data vector, represented by:
18
Kernel Embedding Na ï ve Embedd ’ g, Toy E.g. 1: Parallel Clouds Good at data Poor outside
19
Kernel Embedding Na ï ve Embedd ’ g, Toy E.g. 2: Split X OK at data Strange outside
20
Kernel Embedding Na ï ve Embedd ’ g, Toy E.g. 3: Donut Mostly good Slight mistake for one kernel
21
Kernel Embedding Na ï ve Embedding, Radial basis functions: Toy Example, Main lessons: Generally good in regions with data, Unpredictable where data are sparse
22
Kernel Embedding Toy Example 4: Checkerboard Very Challenging! Linear Method? Polynomial Embedding?
23
Kernel Embedding Toy Example 4: Checkerboard Polynomial EmbeddingPolynomial Embedding: Very poor for linear Slightly better for higher degrees Overall very poor Polynomials don ’ t have needed flexibility
24
Kernel Embedding Toy Example 4: Checkerboard Radial Basis Embedding + FLD Is Excellent!
25
Kernel Embedding Drawbacks to na ï ve embedding: Equally spaced grid too big in high d Not computationally tractable (g d ) Approach: Evaluate only at data points Not on full grid But where data live
26
Kernel Embedding Other types of embedding: Explicit Implicit Will be studied soon, after introduction to Support Vector Machines …
27
Kernel Embedding generalizations of this idea to other types of analysis & some clever computational ideas. E.g. “ Kernel based, nonlinear Principal Components Analysis ” Ref: Sch ö lkopf, Smola and M ü ller (1998)
28
Support Vector Machines Motivation: Find a linear method that “ works well ” for embedded data Note: Embedded data are very non-Gaussian Suggests value of really new approach
29
Support Vector Machines Classical References: Vapnik (1982) Boser, Guyon & Vapnik (1992) Vapnik (1995) Excellent Web Resource: http://www.kernel-machines.org/
30
Support Vector Machines Recommended tutorial: Burges (1998) Recommended Monographs: Cristianini & Shawe-Taylor (2000) Sch ö lkopf & Alex Smola (2002)
31
Support Vector Machines Graphical View, using Toy Example:Toy Example Find separating plane To maximize distances from data to plane In particular smallest distance Data points closest are called support vectors Gap between is called margin
32
SVMs, Optimization Viewpoint Formulate Optimization problem, based on: Data (feature) vectors Class Labels Normal Vector Location (determines intercept) Residuals (right side) Residuals (wrong side) Solve (convex problem) by quadratic programming
33
SVMs, Optimization Viewpoint Lagrange Multipliers primal formulation (separable case): Minimize: Where are Lagrange multipliers Dual Lagrangian version: Maximize: Get classification function:
34
SVMs, Computation Major Computational Point: Classifier only depends on data through inner products! Thus enough to only store inner products Creates big savings in optimization Especially for HDLSS data But also creates variations in kernel embedding (interpretation?!?) This is almost always done in practice
35
SVMs, Comput ’ n & Embedding For an “ Embedding Map ”, e.g. Explicit Embedding: Maximize: Get classification function: Straightforward application of embedding But loses inner product advantage
36
SVMs, Comput ’ n & Embedding Implicit Embedding: Maximize: Get classification function: Still defined only via inner products Retains optimization advantage Thus used very commonly Comparison to explicit embedding? Which is “ better ” ???
37
SVMs & Robustness Usually not severely affected by outliers, But a possible weakness: Can have very influential points Toy E.g., only 2 points drive SVMonly 2 points drive SVM Notes: Huge range of chosen hyperplanes But all are “ pretty good discriminators ” Only happens when whole range is OK??? Good or bad?
38
SVMs & Robustness Effect of violators (toy example):toy example Depends on distance to plane Weak for violators nearby Strong as they move away Can have major impact on plane Also depends on tuning parameter C
39
SVMs, Computation Caution: available algorithms are not created equal Toy Example: Gunn ’ s Matlab codeGunn ’ s Matlab code Todd ’ s Matlab codeTodd ’ s Matlab code Serious errors in Gunn ’ s version, does not find real optimum …
40
SVMs, Tuning Parameter Recall Regularization Parameter C: Controls penalty for violation I.e. lying on wrong side of plane Appears in slack variables Affects performance of SVM Toy ExampleToy Example: d = 50, Spherical Gaussian data
41
SVMs, Tuning Parameter Toy Example: d = 50, Spherical Gaussian data X=Axis: Opt. Dir ’ n Other: SVM Dir ’ n Small C: –Where is the margin? –Small angle to optimal (generalizable) Large C: –More data piling –Larger angle (less generalizable) –Bigger gap (but maybe not better???) Between: Very small range
42
SVMs, Tuning Parameter Toy Example: d = 50, Spherical Gaussian data Careful look at small C: Put MD on horizontal axis E.g.E.g. Shows SVM and MD same for C small –Mathematics behind this? Separates for large C –No data piling for MD
43
Distance Weighted Discrim ’ n Improvement of SVM for HDLSS Data Toy e.g. (similar to earlier movie)
44
Distance Weighted Discrim ’ n Toy e.g.: Maximal Data Piling Direction - Perfect Separation - Gross Overfitting - Large Angle - Poor Gen ’ ability
45
Distance Weighted Discrim ’ n Toy e.g.: Support Vector Machine Direction - Bigger Gap - Smaller Angle - Better Gen ’ ability - Feels support vectors too strongly??? - Ugly subpops? - Improvement?
46
Distance Weighted Discrim ’ n Toy e.g.: Distance Weighted Discrimination - Addresses these issues - Smaller Angle - Better Gen ’ ability - Nice subpops - Replaces min dist. by avg. dist.
47
Distance Weighted Discrim ’ n Based on Optimization Problem: More precisely: Work in appropriate penalty for violations Optimization Method: Second Order Cone Programming “ Still convex ” gen ’ n of quad ’ c program ’ g Allows fast greedy solution Can use available fast software (SDP3, Michael Todd, et al)
48
Distance Weighted Discrim ’ n 2=d Visualization: Pushes Plane Away From Data All Points Have Some Influence
49
49 UNC, Stat & OR DWD Batch and Source Adjustment Recall from Class Meeting, 9/6/05:9/6/05 For Perou ’ s Stanford Breast Cancer Data Analysis in Benito, et al (2004) Bioinformatics https://genome.unc.edu/pubsup/dwd/ Use DWD as useful direction vector to: Adjust for Source Effects Different sources of mRNA Adjust for Batch Effects Arrays fabricated at different times
50
50 UNC, Stat & OR DWD Adj: Biological Class Colors & Symbols
51
51 UNC, Stat & OR DWD Adj: Source Colors
52
52 UNC, Stat & OR DWD Adj: Source Adj’d, PCA view
53
53 UNC, Stat & OR DWD Adj: Source Adj’d, Class Colored
54
54 UNC, Stat & OR DWD Adj: S. & B Adj’d, Adj’d PCA
55
55 UNC, Stat & OR Why not adjust using SVM? Major Problem: Proj’d Distrib’al Shape Triangular Dist’ns (opposite skewed) Does not allow sensible rigid shift
56
56 UNC, Stat & OR Why not adjust using SVM? Nicely Fixed by DWD Projected Dist’ns near Gaussian Sensible to shift
57
57 UNC, Stat & OR Why not adjust by means? DWD is complicated: value added? Xuxin Liu example… Key is sizes of biological subtypes Differing ratio trips up mean But DWD more robust (although still not perfect)
58
58 UNC, Stat & OR Twiddle ratios of subtypes Link to Movie
59
59 UNC, Stat & OR DWD in Face Recognition, I Face Images as Data (with M. Benito & D. Peña) Registered using landmarks Male – Female Difference? Discrimination Rule?
60
60 UNC, Stat & OR DWD in Face Recognition, II DWD Direction Good separation Images “make sense” Garbage at ends? (extrapolation effects?)
61
61 UNC, Stat & OR DWD in Face Recognition, III Interesting summary: Jump between means (in DWD direction) Clear separation of Maleness vs. Femaleness
62
62 UNC, Stat & OR DWD in Face Recognition, IV Fun Comparison: Jump between means (in SVM direction) Also distinguishes Maleness vs. Femaleness But not as well as DWD
63
63 UNC, Stat & OR DWD in Face Recognition, V Analysis of difference: Project onto normals SVM has “small gap” (feels noise artifacts?) DWD “more informative” (feels real structure?)
64
64 UNC, Stat & OR DWD in Face Recognition, VI Current Work: Focus on “drivers”: (regions of interest) Relation to Discr’n? Which is “best”? Lessons for human perception?
66
Fix links on face movies
67
Next Topics: DWD outcomes, from SAMSI below DWD simulations, from SAMSI below Windup from FDA04-22-02.doc –General Conclusion –Validation Also SVMoverviewSAMSI09-06-03.doc
68
Multi-Class SVMs Lee, Y., Lin, Y. and Wahba, G. (2002) "Multicategory Support Vector Machines, Theory, and Application to the Classification of Microarray Data and Satellite Radiance Data", U. Wisc. TR 1064. So far only have “ implicit ” version “ Direction based ” variation is unknown
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.