Object Orie’d Data Analysis, Last Time Classical Discrimination (aka Classification) –FLD & GLR very attractive –MD never better, sometimes worse HDLSS.

1 Object Orie’d Data Analysis, Last Time Classical Discrimination (aka Classification) –FLD & GLR very attractive –MD never better, sometimes worse HDLSS Discrimination –FLD & GLR fall apart –MD much better Maximal Data Piling –HDLSS space is a strange place

2 Kernel Embedding Aizerman, Braverman and Rozoner (1964) Motivating idea: Extend scope of linear discrimination, By adding nonlinear components to data (embedding in a higher dim ’ al space) Better use of name: nonlinear discrimination?

3 Kernel Embedding Stronger effects for higher order polynomial embedding: E.g. for cubic, linear separation can give 4 parts (or fewer)

4 Kernel Embedding General View: for original data matrix: add rows: i.e. embed in Then Higher slice Dimensional with a Space hyperplane

5 Kernel Embedding EmbeddedFisher Linear Discrimination: Choose Class 1, for any when: in embedded space. image of class boundaries in original space is nonlinear allows more complicated class regions Can also do Gaussian Lik. Rat. (or others) Compute image by classifying points from original space

6 Kernel Embedding Visualization for Toy Examples: Have Linear Disc. In Embedded Space Study Effect in Original Data Space Via Implied Nonlinear Regions Approach: Use Test Set in Original Space (dense equally spaced grid) Apply embedded discrimination Rule Color Using the Result

7 Kernel Embedding Polynomial Embedding, Toy Example 1: Parallel Clouds

8 Kernel Embedding Polynomial Embedding, Toy Example 1: Parallel Clouds PC 1:PC 1 –always bad –finds “ embedded greatest var. ” only) FLD:FLD –stays good GLR:GLR –OK discrimination at data –but overfitting problems

9 Kernel Embedding Polynomial Embedding, Toy Example 2: Split X

10 Kernel Embedding Polynomial Embedding, Toy Example 2: Split X FLD:FLD –Rapidly improves with higher degree GLR:GLR –Always good –but never ellipse around blues …

11 Kernel Embedding Polynomial Embedding, Toy Example 3: Donut

12 Kernel Embedding Polynomial Embedding, Toy Example 3: Donut FLD:FLD –Poor fit for low degree –then good –no overfit GLR:GLR –Best with No Embed, –Square shape for overfitting?

13 Kernel Embedding Drawbacks to polynomial embedding: too many extra terms create spurious structure i.e. have “ overfitting ” HDLSS problems typically get worse

14 Kernel Embedding Hot Topic Variation: “ Kernel Machines ” Idea: replace polynomials by other nonlinear functions e.g. 1: sigmoid functions from neural nets e.g. 2: radial basis functions Gaussian kernels Related to “ kernel density estimation ” kernel density estimation (recall: smoothed histogram)

15 Kernel Embedding Radial Basis Functions: Note: there are several ways to embed: Na ï ve Embedding (equally spaced grid) Explicit Embedding (evaluate at data) Implicit Emdedding (inner prod. based) (everybody currently does the latter)

16 Kernel Embedding Na ï ve Embedding, Radial basis functions: At some “ grid points ”, For a “ bandwidth ” (i.e. standard dev ’ n), Consider ( dim ’ al) functions: Replace data matrix with:

17 Kernel Embedding Na ï ve Embedding, Radial basis functions: For discrimination: Work in radial basis space, With new data vector, represented by:

18 Kernel Embedding Na ï ve Embedd ’ g, Toy E.g. 1: Parallel Clouds Good at data Poor outside

19 Kernel Embedding Na ï ve Embedd ’ g, Toy E.g. 2: Split X OK at data Strange outside

20 Kernel Embedding Na ï ve Embedd ’ g, Toy E.g. 3: Donut Mostly good Slight mistake for one kernel

21 Kernel Embedding Na ï ve Embedding, Radial basis functions: Toy Example, Main lessons: Generally good in regions with data, Unpredictable where data are sparse

22 Kernel Embedding Toy Example 4: Checkerboard Very Challenging! Linear Method? Polynomial Embedding?

23 Kernel Embedding Toy Example 4: Checkerboard Polynomial EmbeddingPolynomial Embedding: Very poor for linear Slightly better for higher degrees Overall very poor Polynomials don ’ t have needed flexibility

24 Kernel Embedding Toy Example 4: Checkerboard Radial Basis Embedding + FLD Is Excellent!

25 Kernel Embedding Drawbacks to na ï ve embedding: Equally spaced grid too big in high d Not computationally tractable (g d ) Approach: Evaluate only at data points Not on full grid But where data live

26 Kernel Embedding Other types of embedding: Explicit Implicit Will be studied soon, after introduction to Support Vector Machines …

27 Kernel Embedding generalizations of this idea to other types of analysis & some clever computational ideas. E.g. “ Kernel based, nonlinear Principal Components Analysis ” Ref: Sch ö lkopf, Smola and M ü ller (1998)

28 Support Vector Machines Motivation: Find a linear method that “ works well ” for embedded data Note: Embedded data are very non-Gaussian Suggests value of really new approach

29 Support Vector Machines Classical References: Vapnik (1982) Boser, Guyon & Vapnik (1992) Vapnik (1995) Excellent Web Resource:

30 Support Vector Machines Recommended tutorial: Burges (1998) Recommended Monographs: Cristianini & Shawe-Taylor (2000) Sch ö lkopf & Alex Smola (2002)

31 Support Vector Machines Graphical View, using Toy Example:Toy Example Find separating plane To maximize distances from data to plane In particular smallest distance Data points closest are called support vectors Gap between is called margin

32 SVMs, Optimization Viewpoint Formulate Optimization problem, based on: Data (feature) vectors Class Labels Normal Vector Location (determines intercept) Residuals (right side) Residuals (wrong side) Solve (convex problem) by quadratic programming

33 SVMs, Optimization Viewpoint Lagrange Multipliers primal formulation (separable case): Minimize: Where are Lagrange multipliers Dual Lagrangian version: Maximize: Get classification function:

34 SVMs, Computation Major Computational Point: Classifier only depends on data through inner products! Thus enough to only store inner products Creates big savings in optimization Especially for HDLSS data But also creates variations in kernel embedding (interpretation?!?) This is almost always done in practice

35 SVMs, Comput ’ n & Embedding For an “ Embedding Map ”, e.g. Explicit Embedding: Maximize: Get classification function: Straightforward application of embedding But loses inner product advantage

36 SVMs, Comput ’ n & Embedding Implicit Embedding: Maximize: Get classification function: Still defined only via inner products Retains optimization advantage Thus used very commonly Comparison to explicit embedding? Which is “ better ” ???

37 SVMs & Robustness Usually not severely affected by outliers, But a possible weakness: Can have very influential points Toy E.g., only 2 points drive SVMonly 2 points drive SVM Notes: Huge range of chosen hyperplanes But all are “ pretty good discriminators ” Only happens when whole range is OK??? Good or bad?

38 SVMs & Robustness Effect of violators (toy example):toy example Depends on distance to plane Weak for violators nearby Strong as they move away Can have major impact on plane Also depends on tuning parameter C

39 SVMs, Computation Caution: available algorithms are not created equal Toy Example: Gunn ’ s Matlab codeGunn ’ s Matlab code Todd ’ s Matlab codeTodd ’ s Matlab code Serious errors in Gunn ’ s version, does not find real optimum …

40 SVMs, Tuning Parameter Recall Regularization Parameter C: Controls penalty for violation I.e. lying on wrong side of plane Appears in slack variables Affects performance of SVM Toy ExampleToy Example: d = 50, Spherical Gaussian data

41 SVMs, Tuning Parameter Toy Example: d = 50, Spherical Gaussian data X=Axis: Opt. Dir ’ n Other: SVM Dir ’ n Small C: –Where is the margin? –Small angle to optimal (generalizable) Large C: –More data piling –Larger angle (less generalizable) –Bigger gap (but maybe not better???) Between: Very small range

42 SVMs, Tuning Parameter Toy Example: d = 50, Spherical Gaussian data Careful look at small C: Put MD on horizontal axis E.g.E.g. Shows SVM and MD same for C small –Mathematics behind this? Separates for large C –No data piling for MD

43 Distance Weighted Discrim ’ n Improvement of SVM for HDLSS Data Toy e.g. (similar to earlier movie)

44 Distance Weighted Discrim ’ n Toy e.g.: Maximal Data Piling Direction - Perfect Separation - Gross Overfitting - Large Angle - Poor Gen ’ ability

45 Distance Weighted Discrim ’ n Toy e.g.: Support Vector Machine Direction - Bigger Gap - Smaller Angle - Better Gen ’ ability - Feels support vectors too strongly??? - Ugly subpops? - Improvement?

46 Distance Weighted Discrim ’ n Toy e.g.: Distance Weighted Discrimination - Addresses these issues - Smaller Angle - Better Gen ’ ability - Nice subpops - Replaces min dist. by avg. dist.

47 Distance Weighted Discrim ’ n Based on Optimization Problem: More precisely: Work in appropriate penalty for violations Optimization Method: Second Order Cone Programming “ Still convex ” gen ’ n of quad ’ c program ’ g Allows fast greedy solution Can use available fast software (SDP3, Michael Todd, et al)

48 Distance Weighted Discrim ’ n 2=d Visualization: Pushes Plane Away From Data All Points Have Some Influence

49 49 UNC, Stat & OR DWD Batch and Source Adjustment Recall from Class Meeting, 9/6/05:9/6/05 For Perou ’ s Stanford Breast Cancer Data Analysis in Benito, et al (2004) Bioinformatics Use DWD as useful direction vector to: Adjust for Source Effects Different sources of mRNA Adjust for Batch Effects Arrays fabricated at different times

50 50 UNC, Stat & OR DWD Adj: Biological Class Colors & Symbols

51 51 UNC, Stat & OR DWD Adj: Source Colors

52 52 UNC, Stat & OR DWD Adj: Source Adj’d, PCA view

53 53 UNC, Stat & OR DWD Adj: Source Adj’d, Class Colored

54 54 UNC, Stat & OR DWD Adj: S. & B Adj’d, Adj’d PCA

55 55 UNC, Stat & OR Why not adjust using SVM? Major Problem: Proj’d Distrib’al Shape Triangular Dist’ns (opposite skewed) Does not allow sensible rigid shift

56 56 UNC, Stat & OR Why not adjust using SVM? Nicely Fixed by DWD Projected Dist’ns near Gaussian Sensible to shift

57 57 UNC, Stat & OR Why not adjust by means? DWD is complicated: value added? Xuxin Liu example… Key is sizes of biological subtypes Differing ratio trips up mean But DWD more robust (although still not perfect)

58 58 UNC, Stat & OR Twiddle ratios of subtypes Link to Movie

59 59 UNC, Stat & OR DWD in Face Recognition, I Face Images as Data (with M. Benito & D. Peña) Registered using landmarks Male – Female Difference? Discrimination Rule?

60 60 UNC, Stat & OR DWD in Face Recognition, II DWD Direction Good separation Images “make sense” Garbage at ends? (extrapolation effects?)

61 61 UNC, Stat & OR DWD in Face Recognition, III Interesting summary: Jump between means (in DWD direction) Clear separation of Maleness vs. Femaleness

62 62 UNC, Stat & OR DWD in Face Recognition, IV Fun Comparison: Jump between means (in SVM direction) Also distinguishes Maleness vs. Femaleness But not as well as DWD

63 63 UNC, Stat & OR DWD in Face Recognition, V Analysis of difference: Project onto normals SVM has “small gap” (feels noise artifacts?) DWD “more informative” (feels real structure?)

64 64 UNC, Stat & OR DWD in Face Recognition, VI Current Work: Focus on “drivers”: (regions of interest) Relation to Discr’n? Which is “best”? Lessons for human perception?


68 Multi-Class SVMs Lee, Y., Lin, Y. and Wahba, G. (2002) "Multicategory Support Vector Machines, Theory, and Application to the Classification of Microarray Data and Satellite Radiance Data", U. Wisc. TR 1064. So far only have “ implicit ” version “ Direction based ” variation is unknown

