Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January.

Similar presentations


Presentation on theme: "1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January."— Presentation transcript:

1 1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January 19, 2016

2 2 UNC, Stat & OR Interdisciplinary Relationship How does: Statistics Relate to: Mathematics? (probability, optimization, geometry, …)

3 3 UNC, Stat & OR Statistics - Mathematics Relationship Mathematical Statistics: Validation of existing methods Asymptotics (n  ∞) & Taylor expansion Comparison of existing methods (requires hard math, but really “accounting”???)

4 4 UNC, Stat & OR Statistics - Mathematics Relationship Suggested New Relationship: Put Mathematics to work to Generate New Statistical Ideas/Approaches (publishable in the Ann. Stat.???)

5 5 UNC, Stat & OR Personal Opinions on Mathematical Statistics What is Mathematical Statistics? Validation of existing methods Asymptotics (n  ∞) & Taylor expansion Comparison of existing methods (requires hard math, but really “accounting”???)

6 6 UNC, Stat & OR Personal Opinions on Mathematical Statistics What could Mathematical Statistics be? Basis for invention of new methods Complicated data  mathematical ideas Do we value creativity? Since we don’t do this, others do… (where are the $$$s???)

7 7 UNC, Stat & OR Personal Opinions on Mathematical Statistics Since we don’t do this, others do… Pattern Recognition Artificial Intelligence Neural Nets Data Mining Machine Learning ???

8 8 UNC, Stat & OR Personal Opinions on Mathematical Statistics Possible Litmus Test: Creative Statistics  Clinical Trials Viewpoint: Worst Imaginable Idea  Mathematical Statistics Viewpoint: ???

9 9 UNC, Stat & OR Object Oriented Data Analysis, I What is the “atom” of a statistical analysis? 1 st Course: Numbers Multivariate Analysis Course : Vectors Functional Data Analysis: Curves More generally: Data Objects

10 10 UNC, Stat & OR Functional Data Analysis, I Curves as Data Objects Important Duality: Curve Space  Point Cloud Space Illustrate with Travis Gaydos Graphics 2 dim’al curves (easy to visualize)

11 11 UNC, Stat & OR Functional Data Analysis, Toy EG I

12 12 UNC, Stat & OR Functional Data Analysis, Toy EG II

13 13 UNC, Stat & OR Functional Data Analysis, Toy EG III

14 14 UNC, Stat & OR Functional Data Analysis, Toy EG IV

15 15 UNC, Stat & OR Functional Data Analysis, Toy EG V

16 16 UNC, Stat & OR Functional Data Analysis, Toy EG VI

17 17 UNC, Stat & OR Functional Data Analysis, Toy EG VII

18 18 UNC, Stat & OR Functional Data Analysis, Toy EG VIII

19 19 UNC, Stat & OR Functional Data Analysis, Toy EG IX

20 20 UNC, Stat & OR Functional Data Analysis, Toy EG X

21 21 UNC, Stat & OR Functional Data Analysis, 10-d Toy EG 1

22 22 UNC, Stat & OR Functional Data Analysis, 10-d Toy EG 1

23 23 UNC, Stat & OR Functional Data Analysis, 10-d Toy EG 2

24 24 UNC, Stat & OR Functional Data Analysis, 10-d Toy EG 2

25 25 UNC, Stat & OR Object Oriented Data Analysis, I What is the “atom” of a statistical analysis? 1 st Course: Numbers Multivariate Analysis Course : Vectors Functional Data Analysis: Curves More generally: Data Objects

26 26 UNC, Stat & OR Object Oriented Data Analysis, II Examples: Medical Image Analysis Images as Data Objects? Shape Representations as Objects Micro-arrays for Gene Expression Just multivariate analysis?

27 27 UNC, Stat & OR Object Oriented Data Analysis, III Typical Goals: Understanding population variation Visualization Principal Component Analysis + Discrimination (a.k.a. Classification) Time Series of Data Objects

28 28 UNC, Stat & OR Object Oriented Data Analysis, IV Major Statistical Challenge, I: High Dimension Low Sample Size (HDLSS) Dimension d >> sample size n “Multivariate Analysis” nearly useless Can’t “normalize the data” Land of Opportunity for Statisticians Need for “creative statisticians”

29 29 UNC, Stat & OR Object Oriented Data Analysis, V Major Statistical Challenge, II: Data may live in non-Euclidean space Lie Group / Symmet’c Spaces (manifold data) Trees/Graphs as data objects Interesting Issues: What is “the mean” (pop’n center)? How do we quantify “pop’n variation”?

30 30 UNC, Stat & OR Statistics in Image Analysis, I First Generation Problems: Denoising Segmentation Registration (all about single images)

31 31 UNC, Stat & OR Statistics in Image Analysis, II Second Generation Problems: Populations of Images Understanding Population Variation Discrimination (a.k.a. Classification) Complex Data Structures (& Spaces) HDLSS Statistics

32 32 UNC, Stat & OR HDLSS Statistics in Imaging Why HDLSS (High Dim, Low Sample Size)? Complex 3-d Objects Hard to Represent Often need d = 100’s of parameters Complex 3-d Objects Costly to Segment Often have n = 10’s cases

33 33 UNC, Stat & OR Medical Imaging – A Challenging Example Male Pelvis Bladder – Prostate – Rectum How do they move over time (days)? Critical to Radiation Treatment (cancer) Work with 3-d CT Very Challenging to Segment Find boundary of each object? Represent each Object?

34 34 UNC, Stat & OR Male Pelvis – Raw Data One CT Slice (in 3d image) Coccyx (Tail Bone) Rectum Bladder

35 35 UNC, Stat & OR Male Pelvis – Raw Data Bladder: manual segmentation Slice by slice Reassembled

36 36 UNC, Stat & OR Male Pelvis – Raw Data Bladder: Slices: Reassembled in 3d How to represent? Thanks: Ja-Yeon Jeong

37 37 UNC, Stat & OR Object Representation Landmarks (hard to find) Boundary Rep’ns (no correspondence) Medial representations Find “skeleton” Discretize as “atoms” called M-reps

38 38 UNC, Stat & OR 3-d m-reps Bladder – Prostate – Rectum (multiple objects, J. Y. Jeong) Medial Atoms provide “skeleton” Implied Boundary from “spokes”  “surface”

39 39 UNC, Stat & OR 3-d m-reps M-rep model fitting Easy, when starting from binary (blue) But very expensive (30 – 40 minutes technician’s time) Want automatic approach Challenging, because of poor contrast, noise, … Need to borrow information across training sample Use Bayes approach: prior & likelihood  posterior ~Conjugate Gaussians, but there are issues: Major HLDSS challenges Manifold aspect of data

40 40 UNC, Stat & OR Illuminating Viewpoint Object Space  Feature Space Focus here on collection of data objects Here conceptualize population structure via “point clouds”

41 41 UNC, Stat & OR Personal HDLSS Viewpoint: Data Images (cases) are “Points” In Feature Space Features are Axes Data set is “Point Clouds” Use Proj’ns to visualize

42 42 UNC, Stat & OR Personal HDLSS Viewpoint: PCA Rotated Axes Often Insightful One set of Dir’ns Others Useful, too

43 43 UNC, Stat & OR Cornea Data, I Images as data ~42 Cornea Images Outer surface of eye Heat map of curvature (in radial direction) Hard to understand “population structure”

44 44 UNC, Stat & OR Cornea Data, II PC 1 Starts at Pop’n Mean Overall Curvature Vertical Astigmatism Correlated! Gaussian Projections Visualization: Can’t Overlay (so use movie)

45 45 UNC, Stat & OR Cornea Data, III PC 2 Horrible Outlier! (present in data) But look only in center: Steep at top -- bottom Want Robust PCA For HDLSS data ???

46 46 UNC, Stat & OR Cornea Data, IV Robust PC 2 No outlier impact See top – bottom variation Projections now Gaussian

47 47 UNC, Stat & OR PCA for m-reps, I Major issue: m-reps live in (locations, radius and angles) E.g. “average” of: = ??? Natural Data Structure is: Lie Groups ~ Symmetric spaces (smooth, curved manifolds)

48 48 UNC, Stat & OR PCA for m-reps, II PCA on non-Euclidean spaces? (i.e. on Lie Groups / Symmetric Spaces) T. Fletcher: Principal Geodesic Analysis Idea: replace “linear summary of data” With “geodesic summary of data”…

49 49 UNC, Stat & OR PGA for m-reps, Bladder-Prostate-Rectum Bladder – Prostate – Rectum, 1 person, 17 days PG 1 PG 2 PG 3 (analysis by Ja Yeon Jeong)

50 50 UNC, Stat & OR PGA for m-reps, Bladder-Prostate-Rectum Bladder – Prostate – Rectum, 1 person, 17 days PG 1 PG 2 PG 3 (analysis by Ja Yeon Jeong)

51 51 UNC, Stat & OR PGA for m-reps, Bladder-Prostate-Rectum Bladder – Prostate – Rectum, 1 person, 17 days PG 1 PG 2 PG 3 (analysis by Ja Yeon Jeong)

52 52 UNC, Stat & OR HDLSS Classification (i.e. Discrimination) Background: Two Class (Binary) version: Using “training data” from Class +1, and from Class -1 Develop a “rule” for assigning new data to a Class Canonical Example: Disease Diagnosis New Patients are “Healthy” or “Ill” Determined based on measurements

53 53 UNC, Stat & OR HDLSS Classification (Cont.) Ineffective Methods: Fisher Linear Discrimination Gaussian Likelihood Ratio Less Useful Methods: Nearest Neighbors Neural Nets (“black boxes”, no “directions” or intuition)

54 54 UNC, Stat & OR HDLSS Classification (Cont.) Currently Fashionable Methods: Support Vector Machines Trees Based Approaches New High Tech Method Distance Weighted Discrimination (DWD) Specially designed for HDLSS data Avoids “data piling” problem of SVM Solves more suitable optimization problem

55 55 UNC, Stat & OR HDLSS Classification (Cont.) Currently Fashionable Methods: Trees Based Approaches Support Vector Machines:

56 56 UNC, Stat & OR Kernel Embedding Idea Aizerman, Braverman, Rozoner (1964) Make data linearly separable by embedding in higher dimensional space

57 57 UNC, Stat & OR Kernel Embedding Idea Linearly separable by embedding in higher dimensions

58 58 UNC, Stat & OR Kernel Embedding Idea Linearly separable by embedding in higher dimensions

59 59 UNC, Stat & OR Kernel Embedding Idea Linearly separable by embedding in higher dimensions

60 60 UNC, Stat & OR Kernel Embedding Idea Linearly separable by embedding in higher dimensions

61 61 UNC, Stat & OR Kernel Embedding Idea Linearly separable by embedding in higher dimensions Distributional Assumptions in Embedded Space? ǁ ˅ Support Vector Machine

62 62 UNC, Stat & OR HDLSS Classification (Cont.) Comparison of Linear Methods (toy data): Optimal Direction Excellent, but need dir’n in dim = 50 Maximal Data Piling (J. Y. Ahn, D. Peña) Great separation, but generalizability??? Support Vector Machine More separation, gen’ity, but some data piling? Distance Weighted Discrimination Avoids data piling, good gen’ity, Gaussians?

63 63 UNC, Stat & OR Distance Weighted Discrimination Maximal Data Piling

64 64 UNC, Stat & OR Distance Weighted Discrimination Based on Optimization Problem: More precisely work in appropriate penalty for violations Optimization Method (Michael Todd): Second Order Cone Programming Still Convex gen’tion of quadratic prog’ing Fast greedy solution Can use existing software

65 65 UNC, Stat & OR Simulation Comparison E.G. Above Gaussians: Wide array of dim’s SVM Subst’ly worse MD – Bayes Optimal DWD close to MD

66 66 UNC, Stat & OR Simulation Comparison E.G. Outlier Mixture: Disaster for MD SVM & DWD much more solid Dir’ns are “robust” SVM & DWD similar

67 67 UNC, Stat & OR Simulation Comparison E.G. Wobble Mixture: Disaster for MD SVM less good DWD slightly better Note: All methods come together for larger d ???

68 68 UNC, Stat & OR DWD Bias Adjustment for Microarrays Microarray data: Simult. Measur’ts of “gene expression” Intrinsically HDLSS Dimension d ~ 1,000s – 10,000s Sample Sizes n ~ 10s – 100s My view: Each array is “point in cloud”

69 69 UNC, Stat & OR DWD Batch and Source Adjustment For Perou ’ s Stanford Breast Cancer Data Analysis in Benito, et al (2004) Bioinformatics https://genome.unc.edu/pubsup/dwd/ Adjust for Source Effects Different sources of mRNA Adjust for Batch Effects Arrays fabricated at different times

70 70 UNC, Stat & OR DWD Adj: Raw Breast Cancer data

71 71 UNC, Stat & OR DWD Adj: Source Colors

72 72 UNC, Stat & OR DWD Adj: Batch Colors

73 73 UNC, Stat & OR DWD Adj: Biological Class Colors

74 74 UNC, Stat & OR DWD Adj: Biological Class Colors & Symbols

75 75 UNC, Stat & OR DWD Adj: Biological Class Symbols

76 76 UNC, Stat & OR DWD Adj: Source Colors

77 77 UNC, Stat & OR DWD Adj: PC 1-2 & DWD direction

78 78 UNC, Stat & OR DWD Adj: DWD Source Adjustment

79 79 UNC, Stat & OR DWD Adj: Source Adj’d, PCA view

80 80 UNC, Stat & OR DWD Adj: Source Adj’d, Class Colored

81 81 UNC, Stat & OR DWD Adj: Source Adj’d, Batch Colored

82 82 UNC, Stat & OR DWD Adj: Source Adj’d, 5 PCs

83 83 UNC, Stat & OR DWD Adj: S. Adj’d, Batch 1,2 vs. 3 DWD

84 84 UNC, Stat & OR DWD Adj: S. & B1,2 vs. 3 Adjusted

85 85 UNC, Stat & OR DWD Adj: S. & B1,2 vs. 3 Adj’d, 5 PCs

86 86 UNC, Stat & OR DWD Adj: S. & B Adj’d, B1 vs. 2 DWD

87 87 UNC, Stat & OR DWD Adj: S. & B Adj’d, B1 vs. 2 Adj’d

88 88 UNC, Stat & OR DWD Adj: S. & B Adj’d, 5 PC view

89 89 UNC, Stat & OR DWD Adj: S. & B Adj’d, 4 PC view

90 90 UNC, Stat & OR DWD Adj: S. & B Adj’d, Class Colors

91 91 UNC, Stat & OR DWD Adj: S. & B Adj’d, Adj’d PCA

92 92 UNC, Stat & OR DWD Bias Adjustment for Microarrays Effective for Batch and Source Adj. Also works for cross-platform Adj. E.g. cDNA & Affy Despite literature claiming contrary “Gene by Gene” vs. “Multivariate” views Funded as part of caBIG “Cancer BioInformatics Grid” “Data Combination Effort” of NCI

93 93 UNC, Stat & OR Interesting Benchmark Data Set NCI 60 Cell Lines Interesting benchmark, since same cells Data Web available: http://discover.nci.nih.gov/datasetsNature2000.jsp Both cDNA and Affymetrix Platforms 8 Major cancer subtypes Use DWD now for visualization

94 94 UNC, Stat & OR NCI 60: Fully Adjusted Data, Leukemia Cluster LEUK.CCRFCEM LEUK.K562 LEUK.MOLT4 LEUK.HL60 LEUK.RPMI8266 LEUK.SR

95 95 UNC, Stat & OR NCI 60: Views using DWD Dir’ns (focus on biology)

96 96 UNC, Stat & OR Why not adjust by means? DWD is complicated: value added? Xuxin Liu example… Key is sizes of biological subtypes Differing ratio trips up mean But DWD more robust (although still not perfect)

97 97 UNC, Stat & OR Twiddle ratios of subtypes

98 98 UNC, Stat & OR DWD in Face Recognition, I Face Images as Data (with M. Benito & D. Peña) Registered using landmarks Male – Female Difference? Discrimination Rule?

99 99 UNC, Stat & OR DWD in Face Recognition, II DWD Direction Good separation Images “make sense” Garbage at ends? (extrapolation effects?)

100 100 UNC, Stat & OR DWD in Face Recognition, III Interesting summary: Jump between means (in DWD direction) Clear separation of Maleness vs. Femaleness

101 101 UNC, Stat & OR DWD in Face Recognition, IV Fun Comparison: Jump between means (in SVM direction) Also distinguishes Maleness vs. Femaleness But not as well as DWD

102 102 UNC, Stat & OR DWD in Face Recognition, V Analysis of difference: Project onto normals SVM has “small gap” (feels noise artifacts?) DWD “more informative” (feels real structure?)

103 103 UNC, Stat & OR DWD in Face Recognition, VI Current Work: Focus on “drivers”: (regions of interest) Relation to Discr’n? Which is “best”? Lessons for human perception?

104 104 UNC, Stat & OR Time Series of Curves Chemical Spectra, evolving over time (with J. Wendelberger & E. Kober) Mortality curves changing in time (with Andres Alonzo)

105 105 UNC, Stat & OR Discrimination for m-reps Classification for Lie Groups – Symm. Spaces S. K. Sen, S. Joshi & M. Foskey What is “separating plane” (for SVM-DWD)?

106 106 UNC, Stat & OR Blood vessel tree data Marron’s brain:  Segmented from MRA  Reconstruct trees  in 3d  Rotate to view

107 107 UNC, Stat & OR Blood vessel tree data Marron’s brain:  Segmented from MRA  Reconstruct trees  in 3d  Rotate to view

108 108 UNC, Stat & OR Blood vessel tree data Marron’s brain:  Segmented from MRA  Reconstruct trees  in 3d  Rotate to view

109 109 UNC, Stat & OR Blood vessel tree data Marron’s brain:  Segmented from MRA  Reconstruct trees  in 3d  Rotate to view

110 110 UNC, Stat & OR Marron’s brain:  Segmented from MRA  Reconstruct trees  in 3d  Rotate to view Blood vessel tree data

111 111 UNC, Stat & OR Blood vessel tree data Marron’s brain:  Segmented from MRA  Reconstruct trees  in 3d  Rotate to view

112 112 UNC, Stat & OR Blood vessel tree data Now look over many people (data objects) Structure of population (understand variation?) PCA in strongly non-Euclidean Space???,...,,

113 113 UNC, Stat & OR Blood vessel tree data Possible focus of analysis: Connectivity structure only (topology) Location, size, orientation of segments Structure within each vessel segment,...,,

114 114 UNC, Stat & OR Blood vessel tree data Present Focus: Topology only  Already challenging  Later address others  Then add attributes  To tree nodes  And extend analysis

115 115 UNC, Stat & OR Blood vessel tree data The tree team:  Very Interdsciplinary  Neurosurgery:  Bullitt, Ladha  Statistics:  Wang, Marron  Optimization:  Aydin, Pataki

116 116 UNC, Stat & OR Blood vessel tree data Recall from above: Marron’s brain:  Focus on back  Connectivity (topology) only

117 117 UNC, Stat & OR Blood vessel tree data Present Focus:  Topology only  Raw data as trees  Marron’s reduced tree  Back tree only

118 118 UNC, Stat & OR Blood vessel tree data Topology only E.g. Back Trees Full Population Study as movie Understand variation?

119 119 UNC, Stat & OR Strongly Non-Euclidean Spaces Statistics on Population of Tree-Structured Data Objects? Mean??? Analog of PCA??? Strongly non-Euclidean, since: Space of trees not a linear space Not even approximately linear (no tangent plane)

120 120 UNC, Stat & OR Mildly Non-Euclidean Spaces Useful View of Manifold Data: Tangent Space Center: Frech é t Mean Reason for terminology “ mildly non Euclidean ”

121 121 UNC, Stat & OR Strongly Non-Euclidean Spaces Mean of Population of Tree-Structured Data Objects? Natural approach: Fr é chet mean Requires a metric (distance) on tree space

122 122 UNC, Stat & OR Strongly Non-Euclidean Spaces PCA on Tree Space? Recall Conventional PCA: Directions that explain structure in data Data are points in point cloud 1-d and 2-d projections allow insights about population structure

123 123 UNC, Stat & OR Illust’n of PCA View: PC1 Projections

124 124 UNC, Stat & OR Illust’n of PCA View: Projections on PC1,2 plane

125 125 UNC, Stat & OR Source Batch Adj: PC 1-3 & DWD direction

126 126 UNC, Stat & OR Source Batch Adj: DWD Source Adjustment

127 127 UNC, Stat & OR Strongly Non-Euclidean Spaces PCA on Tree Space? Key Idea (Jim Ramsay): Replace 1-d subspace that best approximates data By 1-d representation that best approximates data Wang and Marron (2007) define notion of Treeline (in structure space)

128 128 UNC, Stat & OR Strongly Non-Euclidean Spaces PCA on Tree Space: Treeline Best 1-d representation of data Basic idea: From some starting tree Grow only in 1 “direction”

129 129 UNC, Stat & OR Strongly Non-Euclidean Spaces PCA on Tree Space: Treeline Best 1-d representation of data Problem: Hard to compute In particular: to solve optimization problem Wang and Marron (2007) Maximum 4 vessel trees Hard to tackle serious trees (e.g. blood vessel trees)

130 130 UNC, Stat & OR Strongly Non-Euclidean Spaces PCA on Tree Space: Treeline Problem: Hard to compute Solution: Burcu Aydin & Gabor Pataki (linear time algorithm) (based on clever “reformulation” of problem)

131 131 UNC, Stat & OR PCA for blood vessel tree data PCA on Tree Space: Treelines Interesting to compare: Population of Left Trees Population of Right Trees Population of Back Trees And to study 1 st, 2 nd, 3 rd & 4 th treelines

132 132 UNC, Stat & OR PCA for blood vessel tree data Study “Directions” 1, 2, 3, 4 For sub- populations B, L, R (interpret later)

133 133 UNC, Stat & OR PCA for blood vessel tree data Notes on Treeline Directions: PC1 always to left BACK has most variation to right (PC2) LEFT has more varia’n to 2 nd level (PC2) RIGHT has more var’n to 1 st level (PC2) See these in the data?

134 134 UNC, Stat & OR PCA for blood vessel tree data Notes: PC1 – all left PC2: BACK - right LEFT 2 nd lev RIGHT 1 st lev See these??

135 135 UNC, Stat & OR Strongly Non-Euclidean Spaces PCA on Tree Space: Treeline Next represent data as projections Define as closest point in tree line (same as Euclidean PCA) Have corresponding score (length of projection along line) And analog of residual (distance from data point to projection)

136 136 UNC, Stat & OR PCA for blood vessel tree data Individual (each PC separately) Scores Plot

137 137 UNC, Stat & OR PCA for blood vessel tree data Data Analytic Goals: Age, Gender See these? No…

138 138 UNC, Stat & OR PCA for blood vessel tree data Directly study age  PC scores PC1 + PC2 - Thickness Not Sig’t - Descendants Left Sig’t

139 139 UNC, Stat & OR Upcoming New Approach Replace Tree-Lines by Tree-Curves:

140 140 UNC, Stat & OR Upcoming New Approach Projections on Tree-Curves:

141 141 UNC, Stat & OR Preliminary Tree-Curve Results First Correlation Of Structure To Age! (Back Trees)

142 142 UNC, Stat & OR Preliminary Tree-Curve Results But does not appear everywhere (Left Trees) Finding locality!

143 143 UNC, Stat & OR HDLSS Asymptotics Why study asymptotics?

144 144 UNC, Stat & OR HDLSS Asymptotics Why study asymptotics?  An interesting (naïve) quote: “I don’t look at asymptotics, because I don’t have an infinite sample size”

145 145 UNC, Stat & OR HDLSS Asymptotics Why study asymptotics?  An interesting (naïve) quote: “I don’t look at asymptotics, because I don’t have an infinite sample size”  Suggested perspective: Asymptotics are a tool for finding simple structure underlying complex entities

146 146 UNC, Stat & OR HDLSS Asymptotics Which asymptotics?  n  ∞ (classical, very widely done)  d  ∞ ???  Sensible?  Follow typical “sampling process”?  Say anything, as noise level increases???

147 147 UNC, Stat & OR HDLSS Asymptotics Which asymptotics?  n  ∞ & d  ∞  n >> d: a few results around (still have classical info in data)  n ~ d: random matrices (Iain J., et al) (nothing classically estimable)  HDLSS asymptotics: n fixed, d  ∞

148 148 UNC, Stat & OR HDLSS Asymptotics HDLSS asymptotics: n fixed, d  ∞  Follow typical “sampling process”?

149 149 UNC, Stat & OR HDLSS Asymptotics HDLSS asymptotics: n fixed, d  ∞  Follow typical “sampling process”?  Microarrays: # genes bounded  Proteomics, SNPs, …  A moot point, from perspective: Asymptotics are a tool for finding simple structure underlying complex entities

150 150 UNC, Stat & OR HDLSS Asymptotics HDLSS asymptotics: n fixed, d  ∞  Say anything, as noise level increases???

151 151 UNC, Stat & OR HDLSS Asymptotics HDLSS asymptotics: n fixed, d  ∞  Say anything, as noise level increases??? Yes, there exists simple, perhaps surprising, underlying structure

152 152 UNC, Stat & OR HDLSS Asymptotics: Simple Paradoxes, I For dim’al “Standard Normal” dist’n: Euclidean Distance to Origin (as ): - Data lie roughly on surface of sphere of radius - Yet origin is point of “highest density”??? - Paradox resolved by: “density w. r. t. Lebesgue Measure”

153 153 UNC, Stat & OR HDLSS Asymptotics: Simple Paradoxes, II For dim’al “Standard Normal” dist’n: indep. of Euclidean Dist. between and (as ): Distance tends to non-random constant: Can extend to Where do they all go??? (we can only perceive 3 dim’ns)

154 154 UNC, Stat & OR HDLSS Asymptotics: Simple Paradoxes, III For dim’al “Standard Normal” dist’n: indep. of High dim’al Angles (as ): - -“Everything is orthogonal”??? - Where do they all go??? (again our perceptual limitations) - Again 1st order structure is non-random

155 155 UNC, Stat & OR HDLSS Asy’s: Geometrical Representation, I Assume, let Study Subspace Generated by Data a. Hyperplane through 0, of dimension b. Points are “nearly equidistant to 0”, & dist c. Within plane, can “rotate towards Unit Simplex” d. All Gaussian data sets are“near Unit Simplex Vertices”!!! “Randomness” appears only in rotation of simplex With P. Hall & A. Neeman

156 156 UNC, Stat & OR HDLSS Asy’s: Geometrical Representation, II Assume, let Study Hyperplane Generated by Data a. dimensional hyperplane b. Points are pairwise equidistant, dist c. Points lie at vertices of “regular hedron” d. Again “randomness in data” is only in rotation e. Surprisingly rigid structure in data?

157 157 UNC, Stat & OR HDLSS Asy’s: Geometrical Representation, III Simulation View: shows “rigidity after rotation”

158 158 UNC, Stat & OR HDLSS Asy’s: Geometrical Representation, III Straightforward Generalizations: non-Gaussian data: only need moments non-independent: use “mixing conditions” (with P. Hall & A. Neeman) Mild Eigenvalue condition on Theoretical Cov. (with J. Ahn, K. Muller & Y. Chi) Mixing Condition on Stand’d & Permuted Var’s (with S. Jung) All based on simple “Laws of Large Numbers”

159 159 UNC, Stat & OR 2 nd Paper on HDLSS Asymptotics Ahn, Marron, Muller & Chi (2007) Biometrika  Assume 2 nd Moments (and Gaussian)  Assume no eigenvalues too large in sense: For assume i.e. (min possible) (much weaker than previous mixing conditions…)

160 160 UNC, Stat & OR HDLSS Asy’s: Geometrical Representation, IV Explanation of Observed (Simulation) Behavior: “everything similar for very high d” 2 popn’s are 2 simplices (i.e. regular n-hedrons) All are same distance from the other class i.e. everything is a support vector i.e. all sensible directions show “data piling” so “sensible methods are all nearly the same” Including 1 - NN

161 161 UNC, Stat & OR HDLSS Asy’s: Geometrical Representation, V Further Consequences of Geometric Representation 1. Inefficiency of DWD for uneven sample size (motivates “weighted version”, work in progress) 2. DWD more “stable” than SVM (based on “deeper limiting distributions”) (reflects intuitive idea “feeling sampling variation”) (something like “mean vs. median”) 3. 1-NN rule inefficiency is quantified.

162 162 UNC, Stat & OR HDLSS Math. Stat. of PCA, I Consistency & Strong Inconsistency: Spike Covariance Model (Johnstone & Paul) For Eigenvalues: 1 st Eigenvector: How good are empirical versions, as estimates?

163 163 UNC, Stat & OR HDLSS Math. Stat. of PCA, II Consistency (big enough spike): For, Strong Inconsistency (spike not big enough): For,

164 164 UNC, Stat & OR HDLSS Math. Stat. of PCA, III Consistency of eigenvalues?  Eigenvalues Inconsistent  But known distribution  Unless as well

165 165 UNC, Stat & OR HDLSS Work in Progress, I Batch Adjustment: Xuxin Liu Recall Intuition from above: Key is sizes of biological subtypes Differing ratio trips up mean But DWD more robust Mathematics behind this?

166 166 UNC, Stat & OR Liu: Twiddle ratios of subtypes

167 167 UNC, Stat & OR HDLSS Data Combo Mathematics Xuxin Liu Dissertation Results:  Simple Unbalanced Cluster Model  Growing at rate as  Answers depend on Visualization of setting….

168 168 UNC, Stat & OR HDLSS Data Combo Mathematics

169 169 UNC, Stat & OR HDLSS Data Combo Mathematics

170 170 UNC, Stat & OR HDLSS Data Combo Mathematics Asymptotic Results (as ):  For, DWD Consistent Angle(DWD,Truth)  For, DWD Strongly Inconsistent Angle(DWD,Truth)

171 171 UNC, Stat & OR HDLSS Data Combo Mathematics Asymptotic Results (as ):  For, PAM Inconsistent Angle(PAM,Truth)  For, DWD Strongly Inconsistent Angle(PAM,Truth)

172 172 UNC, Stat & OR HDLSS Data Combo Mathematics Value of, for sample size ratio : , only when  Otherwise for, PAM Inconsistent  Verifies intuitive idea in strong way

173 173 UNC, Stat & OR HDLSS Work in Progress, II Canonical Correlations: Myung Hee Lee  Results similar to those for those for PCA  Singular values inconsistent  But directions converge under a much milder spike assumption.

174 174 UNC, Stat & OR HDLSS Work in Progress, III Conditions for Geo. Rep’n & PCA Consist.: John Kent example: Can only say: not deterministic Conclude: need some flavor of mixing

175 175 UNC, Stat & OR HDLSS Work in Progress, III Conditions for Geo. Rep’n & PCA Consist.: Conclude: need some flavor of mixing Challenge: Classical mixing conditions require notion of time ordering Not always clear, e.g. microarrays

176 176 UNC, Stat & OR HDLSS Work in Progress, III Conditions for Geo. Rep’n & PCA Consist.: Sungkyu Jung Condition: where Define: Assume: Ǝ a permutation, So that is ρ-mixing

177 177 UNC, Stat & OR HDLSS Deep Open Problem In PCA Consistency:  Strong Inconsistency - spike  Consistency - spike What happens at boundary ( )???

178 178 UNC, Stat & OR The Future of HDLSS Asymptotics? 1. Address your favorite statistical problem… 2. HDLSS versions of classical optimality results? 3. Continguity Approach (~Random Matrices) 4. Rates of convergence? 5. Improved Discrimination Methods? It is early days…

179 179 UNC, Stat & OR The Future of Geometrical Representation? HDLSS version of “optimality” results? “Contiguity” approach? Params depend on d? Rates of Convergence? Improvements of DWD? (e.g. other functions of distance than inverse) It is still early days …

180 180 UNC, Stat & OR Some Carry Away Lessons Atoms of the Analysis: Object Oriented Viewpoint: Object Space  Feature Space DWD is attractive for HDLSS classification “Randomness” in HDLSS data is only in rotations (Modulo rotation, have constant simplex shape) How to put HDLSS asymptotics to work?

181 181 UNC, Stat & OR Object Oriented Data Analysis Potential Future Opportunity: OODA SAMSI Program 2010-2011 Interested in joining? Let’s talk


Download ppt "1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January."

Similar presentations


Ads by Google