Download presentation
Presentation is loading. Please wait.
Published byCameron Bennett Modified over 9 years ago
1
1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January 19, 2016
2
2 UNC, Stat & OR Interdisciplinary Relationship How does: Statistics Relate to: Mathematics? (probability, optimization, geometry, …)
3
3 UNC, Stat & OR Statistics - Mathematics Relationship Mathematical Statistics: Validation of existing methods Asymptotics (n ∞) & Taylor expansion Comparison of existing methods (requires hard math, but really “accounting”???)
4
4 UNC, Stat & OR Statistics - Mathematics Relationship Suggested New Relationship: Put Mathematics to work to Generate New Statistical Ideas/Approaches (publishable in the Ann. Stat.???)
5
5 UNC, Stat & OR Personal Opinions on Mathematical Statistics What is Mathematical Statistics? Validation of existing methods Asymptotics (n ∞) & Taylor expansion Comparison of existing methods (requires hard math, but really “accounting”???)
6
6 UNC, Stat & OR Personal Opinions on Mathematical Statistics What could Mathematical Statistics be? Basis for invention of new methods Complicated data mathematical ideas Do we value creativity? Since we don’t do this, others do… (where are the $$$s???)
7
7 UNC, Stat & OR Personal Opinions on Mathematical Statistics Since we don’t do this, others do… Pattern Recognition Artificial Intelligence Neural Nets Data Mining Machine Learning ???
8
8 UNC, Stat & OR Personal Opinions on Mathematical Statistics Possible Litmus Test: Creative Statistics Clinical Trials Viewpoint: Worst Imaginable Idea Mathematical Statistics Viewpoint: ???
9
9 UNC, Stat & OR Object Oriented Data Analysis, I What is the “atom” of a statistical analysis? 1 st Course: Numbers Multivariate Analysis Course : Vectors Functional Data Analysis: Curves More generally: Data Objects
10
10 UNC, Stat & OR Functional Data Analysis, I Curves as Data Objects Important Duality: Curve Space Point Cloud Space Illustrate with Travis Gaydos Graphics 2 dim’al curves (easy to visualize)
11
11 UNC, Stat & OR Functional Data Analysis, Toy EG I
12
12 UNC, Stat & OR Functional Data Analysis, Toy EG II
13
13 UNC, Stat & OR Functional Data Analysis, Toy EG III
14
14 UNC, Stat & OR Functional Data Analysis, Toy EG IV
15
15 UNC, Stat & OR Functional Data Analysis, Toy EG V
16
16 UNC, Stat & OR Functional Data Analysis, Toy EG VI
17
17 UNC, Stat & OR Functional Data Analysis, Toy EG VII
18
18 UNC, Stat & OR Functional Data Analysis, Toy EG VIII
19
19 UNC, Stat & OR Functional Data Analysis, Toy EG IX
20
20 UNC, Stat & OR Functional Data Analysis, Toy EG X
21
21 UNC, Stat & OR Functional Data Analysis, 10-d Toy EG 1
22
22 UNC, Stat & OR Functional Data Analysis, 10-d Toy EG 1
23
23 UNC, Stat & OR Functional Data Analysis, 10-d Toy EG 2
24
24 UNC, Stat & OR Functional Data Analysis, 10-d Toy EG 2
25
25 UNC, Stat & OR Object Oriented Data Analysis, I What is the “atom” of a statistical analysis? 1 st Course: Numbers Multivariate Analysis Course : Vectors Functional Data Analysis: Curves More generally: Data Objects
26
26 UNC, Stat & OR Object Oriented Data Analysis, II Examples: Medical Image Analysis Images as Data Objects? Shape Representations as Objects Micro-arrays for Gene Expression Just multivariate analysis?
27
27 UNC, Stat & OR Object Oriented Data Analysis, III Typical Goals: Understanding population variation Visualization Principal Component Analysis + Discrimination (a.k.a. Classification) Time Series of Data Objects
28
28 UNC, Stat & OR Object Oriented Data Analysis, IV Major Statistical Challenge, I: High Dimension Low Sample Size (HDLSS) Dimension d >> sample size n “Multivariate Analysis” nearly useless Can’t “normalize the data” Land of Opportunity for Statisticians Need for “creative statisticians”
29
29 UNC, Stat & OR Object Oriented Data Analysis, V Major Statistical Challenge, II: Data may live in non-Euclidean space Lie Group / Symmet’c Spaces (manifold data) Trees/Graphs as data objects Interesting Issues: What is “the mean” (pop’n center)? How do we quantify “pop’n variation”?
30
30 UNC, Stat & OR Statistics in Image Analysis, I First Generation Problems: Denoising Segmentation Registration (all about single images)
31
31 UNC, Stat & OR Statistics in Image Analysis, II Second Generation Problems: Populations of Images Understanding Population Variation Discrimination (a.k.a. Classification) Complex Data Structures (& Spaces) HDLSS Statistics
32
32 UNC, Stat & OR HDLSS Statistics in Imaging Why HDLSS (High Dim, Low Sample Size)? Complex 3-d Objects Hard to Represent Often need d = 100’s of parameters Complex 3-d Objects Costly to Segment Often have n = 10’s cases
33
33 UNC, Stat & OR Medical Imaging – A Challenging Example Male Pelvis Bladder – Prostate – Rectum How do they move over time (days)? Critical to Radiation Treatment (cancer) Work with 3-d CT Very Challenging to Segment Find boundary of each object? Represent each Object?
34
34 UNC, Stat & OR Male Pelvis – Raw Data One CT Slice (in 3d image) Coccyx (Tail Bone) Rectum Bladder
35
35 UNC, Stat & OR Male Pelvis – Raw Data Bladder: manual segmentation Slice by slice Reassembled
36
36 UNC, Stat & OR Male Pelvis – Raw Data Bladder: Slices: Reassembled in 3d How to represent? Thanks: Ja-Yeon Jeong
37
37 UNC, Stat & OR Object Representation Landmarks (hard to find) Boundary Rep’ns (no correspondence) Medial representations Find “skeleton” Discretize as “atoms” called M-reps
38
38 UNC, Stat & OR 3-d m-reps Bladder – Prostate – Rectum (multiple objects, J. Y. Jeong) Medial Atoms provide “skeleton” Implied Boundary from “spokes” “surface”
39
39 UNC, Stat & OR 3-d m-reps M-rep model fitting Easy, when starting from binary (blue) But very expensive (30 – 40 minutes technician’s time) Want automatic approach Challenging, because of poor contrast, noise, … Need to borrow information across training sample Use Bayes approach: prior & likelihood posterior ~Conjugate Gaussians, but there are issues: Major HLDSS challenges Manifold aspect of data
40
40 UNC, Stat & OR Illuminating Viewpoint Object Space Feature Space Focus here on collection of data objects Here conceptualize population structure via “point clouds”
41
41 UNC, Stat & OR Personal HDLSS Viewpoint: Data Images (cases) are “Points” In Feature Space Features are Axes Data set is “Point Clouds” Use Proj’ns to visualize
42
42 UNC, Stat & OR Personal HDLSS Viewpoint: PCA Rotated Axes Often Insightful One set of Dir’ns Others Useful, too
43
43 UNC, Stat & OR Cornea Data, I Images as data ~42 Cornea Images Outer surface of eye Heat map of curvature (in radial direction) Hard to understand “population structure”
44
44 UNC, Stat & OR Cornea Data, II PC 1 Starts at Pop’n Mean Overall Curvature Vertical Astigmatism Correlated! Gaussian Projections Visualization: Can’t Overlay (so use movie)
45
45 UNC, Stat & OR Cornea Data, III PC 2 Horrible Outlier! (present in data) But look only in center: Steep at top -- bottom Want Robust PCA For HDLSS data ???
46
46 UNC, Stat & OR Cornea Data, IV Robust PC 2 No outlier impact See top – bottom variation Projections now Gaussian
47
47 UNC, Stat & OR PCA for m-reps, I Major issue: m-reps live in (locations, radius and angles) E.g. “average” of: = ??? Natural Data Structure is: Lie Groups ~ Symmetric spaces (smooth, curved manifolds)
48
48 UNC, Stat & OR PCA for m-reps, II PCA on non-Euclidean spaces? (i.e. on Lie Groups / Symmetric Spaces) T. Fletcher: Principal Geodesic Analysis Idea: replace “linear summary of data” With “geodesic summary of data”…
49
49 UNC, Stat & OR PGA for m-reps, Bladder-Prostate-Rectum Bladder – Prostate – Rectum, 1 person, 17 days PG 1 PG 2 PG 3 (analysis by Ja Yeon Jeong)
50
50 UNC, Stat & OR PGA for m-reps, Bladder-Prostate-Rectum Bladder – Prostate – Rectum, 1 person, 17 days PG 1 PG 2 PG 3 (analysis by Ja Yeon Jeong)
51
51 UNC, Stat & OR PGA for m-reps, Bladder-Prostate-Rectum Bladder – Prostate – Rectum, 1 person, 17 days PG 1 PG 2 PG 3 (analysis by Ja Yeon Jeong)
52
52 UNC, Stat & OR HDLSS Classification (i.e. Discrimination) Background: Two Class (Binary) version: Using “training data” from Class +1, and from Class -1 Develop a “rule” for assigning new data to a Class Canonical Example: Disease Diagnosis New Patients are “Healthy” or “Ill” Determined based on measurements
53
53 UNC, Stat & OR HDLSS Classification (Cont.) Ineffective Methods: Fisher Linear Discrimination Gaussian Likelihood Ratio Less Useful Methods: Nearest Neighbors Neural Nets (“black boxes”, no “directions” or intuition)
54
54 UNC, Stat & OR HDLSS Classification (Cont.) Currently Fashionable Methods: Support Vector Machines Trees Based Approaches New High Tech Method Distance Weighted Discrimination (DWD) Specially designed for HDLSS data Avoids “data piling” problem of SVM Solves more suitable optimization problem
55
55 UNC, Stat & OR HDLSS Classification (Cont.) Currently Fashionable Methods: Trees Based Approaches Support Vector Machines:
56
56 UNC, Stat & OR Kernel Embedding Idea Aizerman, Braverman, Rozoner (1964) Make data linearly separable by embedding in higher dimensional space
57
57 UNC, Stat & OR Kernel Embedding Idea Linearly separable by embedding in higher dimensions
58
58 UNC, Stat & OR Kernel Embedding Idea Linearly separable by embedding in higher dimensions
59
59 UNC, Stat & OR Kernel Embedding Idea Linearly separable by embedding in higher dimensions
60
60 UNC, Stat & OR Kernel Embedding Idea Linearly separable by embedding in higher dimensions
61
61 UNC, Stat & OR Kernel Embedding Idea Linearly separable by embedding in higher dimensions Distributional Assumptions in Embedded Space? ǁ ˅ Support Vector Machine
62
62 UNC, Stat & OR HDLSS Classification (Cont.) Comparison of Linear Methods (toy data): Optimal Direction Excellent, but need dir’n in dim = 50 Maximal Data Piling (J. Y. Ahn, D. Peña) Great separation, but generalizability??? Support Vector Machine More separation, gen’ity, but some data piling? Distance Weighted Discrimination Avoids data piling, good gen’ity, Gaussians?
63
63 UNC, Stat & OR Distance Weighted Discrimination Maximal Data Piling
64
64 UNC, Stat & OR Distance Weighted Discrimination Based on Optimization Problem: More precisely work in appropriate penalty for violations Optimization Method (Michael Todd): Second Order Cone Programming Still Convex gen’tion of quadratic prog’ing Fast greedy solution Can use existing software
65
65 UNC, Stat & OR Simulation Comparison E.G. Above Gaussians: Wide array of dim’s SVM Subst’ly worse MD – Bayes Optimal DWD close to MD
66
66 UNC, Stat & OR Simulation Comparison E.G. Outlier Mixture: Disaster for MD SVM & DWD much more solid Dir’ns are “robust” SVM & DWD similar
67
67 UNC, Stat & OR Simulation Comparison E.G. Wobble Mixture: Disaster for MD SVM less good DWD slightly better Note: All methods come together for larger d ???
68
68 UNC, Stat & OR DWD Bias Adjustment for Microarrays Microarray data: Simult. Measur’ts of “gene expression” Intrinsically HDLSS Dimension d ~ 1,000s – 10,000s Sample Sizes n ~ 10s – 100s My view: Each array is “point in cloud”
69
69 UNC, Stat & OR DWD Batch and Source Adjustment For Perou ’ s Stanford Breast Cancer Data Analysis in Benito, et al (2004) Bioinformatics https://genome.unc.edu/pubsup/dwd/ Adjust for Source Effects Different sources of mRNA Adjust for Batch Effects Arrays fabricated at different times
70
70 UNC, Stat & OR DWD Adj: Raw Breast Cancer data
71
71 UNC, Stat & OR DWD Adj: Source Colors
72
72 UNC, Stat & OR DWD Adj: Batch Colors
73
73 UNC, Stat & OR DWD Adj: Biological Class Colors
74
74 UNC, Stat & OR DWD Adj: Biological Class Colors & Symbols
75
75 UNC, Stat & OR DWD Adj: Biological Class Symbols
76
76 UNC, Stat & OR DWD Adj: Source Colors
77
77 UNC, Stat & OR DWD Adj: PC 1-2 & DWD direction
78
78 UNC, Stat & OR DWD Adj: DWD Source Adjustment
79
79 UNC, Stat & OR DWD Adj: Source Adj’d, PCA view
80
80 UNC, Stat & OR DWD Adj: Source Adj’d, Class Colored
81
81 UNC, Stat & OR DWD Adj: Source Adj’d, Batch Colored
82
82 UNC, Stat & OR DWD Adj: Source Adj’d, 5 PCs
83
83 UNC, Stat & OR DWD Adj: S. Adj’d, Batch 1,2 vs. 3 DWD
84
84 UNC, Stat & OR DWD Adj: S. & B1,2 vs. 3 Adjusted
85
85 UNC, Stat & OR DWD Adj: S. & B1,2 vs. 3 Adj’d, 5 PCs
86
86 UNC, Stat & OR DWD Adj: S. & B Adj’d, B1 vs. 2 DWD
87
87 UNC, Stat & OR DWD Adj: S. & B Adj’d, B1 vs. 2 Adj’d
88
88 UNC, Stat & OR DWD Adj: S. & B Adj’d, 5 PC view
89
89 UNC, Stat & OR DWD Adj: S. & B Adj’d, 4 PC view
90
90 UNC, Stat & OR DWD Adj: S. & B Adj’d, Class Colors
91
91 UNC, Stat & OR DWD Adj: S. & B Adj’d, Adj’d PCA
92
92 UNC, Stat & OR DWD Bias Adjustment for Microarrays Effective for Batch and Source Adj. Also works for cross-platform Adj. E.g. cDNA & Affy Despite literature claiming contrary “Gene by Gene” vs. “Multivariate” views Funded as part of caBIG “Cancer BioInformatics Grid” “Data Combination Effort” of NCI
93
93 UNC, Stat & OR Interesting Benchmark Data Set NCI 60 Cell Lines Interesting benchmark, since same cells Data Web available: http://discover.nci.nih.gov/datasetsNature2000.jsp Both cDNA and Affymetrix Platforms 8 Major cancer subtypes Use DWD now for visualization
94
94 UNC, Stat & OR NCI 60: Fully Adjusted Data, Leukemia Cluster LEUK.CCRFCEM LEUK.K562 LEUK.MOLT4 LEUK.HL60 LEUK.RPMI8266 LEUK.SR
95
95 UNC, Stat & OR NCI 60: Views using DWD Dir’ns (focus on biology)
96
96 UNC, Stat & OR Why not adjust by means? DWD is complicated: value added? Xuxin Liu example… Key is sizes of biological subtypes Differing ratio trips up mean But DWD more robust (although still not perfect)
97
97 UNC, Stat & OR Twiddle ratios of subtypes
98
98 UNC, Stat & OR DWD in Face Recognition, I Face Images as Data (with M. Benito & D. Peña) Registered using landmarks Male – Female Difference? Discrimination Rule?
99
99 UNC, Stat & OR DWD in Face Recognition, II DWD Direction Good separation Images “make sense” Garbage at ends? (extrapolation effects?)
100
100 UNC, Stat & OR DWD in Face Recognition, III Interesting summary: Jump between means (in DWD direction) Clear separation of Maleness vs. Femaleness
101
101 UNC, Stat & OR DWD in Face Recognition, IV Fun Comparison: Jump between means (in SVM direction) Also distinguishes Maleness vs. Femaleness But not as well as DWD
102
102 UNC, Stat & OR DWD in Face Recognition, V Analysis of difference: Project onto normals SVM has “small gap” (feels noise artifacts?) DWD “more informative” (feels real structure?)
103
103 UNC, Stat & OR DWD in Face Recognition, VI Current Work: Focus on “drivers”: (regions of interest) Relation to Discr’n? Which is “best”? Lessons for human perception?
104
104 UNC, Stat & OR Time Series of Curves Chemical Spectra, evolving over time (with J. Wendelberger & E. Kober) Mortality curves changing in time (with Andres Alonzo)
105
105 UNC, Stat & OR Discrimination for m-reps Classification for Lie Groups – Symm. Spaces S. K. Sen, S. Joshi & M. Foskey What is “separating plane” (for SVM-DWD)?
106
106 UNC, Stat & OR Blood vessel tree data Marron’s brain: Segmented from MRA Reconstruct trees in 3d Rotate to view
107
107 UNC, Stat & OR Blood vessel tree data Marron’s brain: Segmented from MRA Reconstruct trees in 3d Rotate to view
108
108 UNC, Stat & OR Blood vessel tree data Marron’s brain: Segmented from MRA Reconstruct trees in 3d Rotate to view
109
109 UNC, Stat & OR Blood vessel tree data Marron’s brain: Segmented from MRA Reconstruct trees in 3d Rotate to view
110
110 UNC, Stat & OR Marron’s brain: Segmented from MRA Reconstruct trees in 3d Rotate to view Blood vessel tree data
111
111 UNC, Stat & OR Blood vessel tree data Marron’s brain: Segmented from MRA Reconstruct trees in 3d Rotate to view
112
112 UNC, Stat & OR Blood vessel tree data Now look over many people (data objects) Structure of population (understand variation?) PCA in strongly non-Euclidean Space???,...,,
113
113 UNC, Stat & OR Blood vessel tree data Possible focus of analysis: Connectivity structure only (topology) Location, size, orientation of segments Structure within each vessel segment,...,,
114
114 UNC, Stat & OR Blood vessel tree data Present Focus: Topology only Already challenging Later address others Then add attributes To tree nodes And extend analysis
115
115 UNC, Stat & OR Blood vessel tree data The tree team: Very Interdsciplinary Neurosurgery: Bullitt, Ladha Statistics: Wang, Marron Optimization: Aydin, Pataki
116
116 UNC, Stat & OR Blood vessel tree data Recall from above: Marron’s brain: Focus on back Connectivity (topology) only
117
117 UNC, Stat & OR Blood vessel tree data Present Focus: Topology only Raw data as trees Marron’s reduced tree Back tree only
118
118 UNC, Stat & OR Blood vessel tree data Topology only E.g. Back Trees Full Population Study as movie Understand variation?
119
119 UNC, Stat & OR Strongly Non-Euclidean Spaces Statistics on Population of Tree-Structured Data Objects? Mean??? Analog of PCA??? Strongly non-Euclidean, since: Space of trees not a linear space Not even approximately linear (no tangent plane)
120
120 UNC, Stat & OR Mildly Non-Euclidean Spaces Useful View of Manifold Data: Tangent Space Center: Frech é t Mean Reason for terminology “ mildly non Euclidean ”
121
121 UNC, Stat & OR Strongly Non-Euclidean Spaces Mean of Population of Tree-Structured Data Objects? Natural approach: Fr é chet mean Requires a metric (distance) on tree space
122
122 UNC, Stat & OR Strongly Non-Euclidean Spaces PCA on Tree Space? Recall Conventional PCA: Directions that explain structure in data Data are points in point cloud 1-d and 2-d projections allow insights about population structure
123
123 UNC, Stat & OR Illust’n of PCA View: PC1 Projections
124
124 UNC, Stat & OR Illust’n of PCA View: Projections on PC1,2 plane
125
125 UNC, Stat & OR Source Batch Adj: PC 1-3 & DWD direction
126
126 UNC, Stat & OR Source Batch Adj: DWD Source Adjustment
127
127 UNC, Stat & OR Strongly Non-Euclidean Spaces PCA on Tree Space? Key Idea (Jim Ramsay): Replace 1-d subspace that best approximates data By 1-d representation that best approximates data Wang and Marron (2007) define notion of Treeline (in structure space)
128
128 UNC, Stat & OR Strongly Non-Euclidean Spaces PCA on Tree Space: Treeline Best 1-d representation of data Basic idea: From some starting tree Grow only in 1 “direction”
129
129 UNC, Stat & OR Strongly Non-Euclidean Spaces PCA on Tree Space: Treeline Best 1-d representation of data Problem: Hard to compute In particular: to solve optimization problem Wang and Marron (2007) Maximum 4 vessel trees Hard to tackle serious trees (e.g. blood vessel trees)
130
130 UNC, Stat & OR Strongly Non-Euclidean Spaces PCA on Tree Space: Treeline Problem: Hard to compute Solution: Burcu Aydin & Gabor Pataki (linear time algorithm) (based on clever “reformulation” of problem)
131
131 UNC, Stat & OR PCA for blood vessel tree data PCA on Tree Space: Treelines Interesting to compare: Population of Left Trees Population of Right Trees Population of Back Trees And to study 1 st, 2 nd, 3 rd & 4 th treelines
132
132 UNC, Stat & OR PCA for blood vessel tree data Study “Directions” 1, 2, 3, 4 For sub- populations B, L, R (interpret later)
133
133 UNC, Stat & OR PCA for blood vessel tree data Notes on Treeline Directions: PC1 always to left BACK has most variation to right (PC2) LEFT has more varia’n to 2 nd level (PC2) RIGHT has more var’n to 1 st level (PC2) See these in the data?
134
134 UNC, Stat & OR PCA for blood vessel tree data Notes: PC1 – all left PC2: BACK - right LEFT 2 nd lev RIGHT 1 st lev See these??
135
135 UNC, Stat & OR Strongly Non-Euclidean Spaces PCA on Tree Space: Treeline Next represent data as projections Define as closest point in tree line (same as Euclidean PCA) Have corresponding score (length of projection along line) And analog of residual (distance from data point to projection)
136
136 UNC, Stat & OR PCA for blood vessel tree data Individual (each PC separately) Scores Plot
137
137 UNC, Stat & OR PCA for blood vessel tree data Data Analytic Goals: Age, Gender See these? No…
138
138 UNC, Stat & OR PCA for blood vessel tree data Directly study age PC scores PC1 + PC2 - Thickness Not Sig’t - Descendants Left Sig’t
139
139 UNC, Stat & OR Upcoming New Approach Replace Tree-Lines by Tree-Curves:
140
140 UNC, Stat & OR Upcoming New Approach Projections on Tree-Curves:
141
141 UNC, Stat & OR Preliminary Tree-Curve Results First Correlation Of Structure To Age! (Back Trees)
142
142 UNC, Stat & OR Preliminary Tree-Curve Results But does not appear everywhere (Left Trees) Finding locality!
143
143 UNC, Stat & OR HDLSS Asymptotics Why study asymptotics?
144
144 UNC, Stat & OR HDLSS Asymptotics Why study asymptotics? An interesting (naïve) quote: “I don’t look at asymptotics, because I don’t have an infinite sample size”
145
145 UNC, Stat & OR HDLSS Asymptotics Why study asymptotics? An interesting (naïve) quote: “I don’t look at asymptotics, because I don’t have an infinite sample size” Suggested perspective: Asymptotics are a tool for finding simple structure underlying complex entities
146
146 UNC, Stat & OR HDLSS Asymptotics Which asymptotics? n ∞ (classical, very widely done) d ∞ ??? Sensible? Follow typical “sampling process”? Say anything, as noise level increases???
147
147 UNC, Stat & OR HDLSS Asymptotics Which asymptotics? n ∞ & d ∞ n >> d: a few results around (still have classical info in data) n ~ d: random matrices (Iain J., et al) (nothing classically estimable) HDLSS asymptotics: n fixed, d ∞
148
148 UNC, Stat & OR HDLSS Asymptotics HDLSS asymptotics: n fixed, d ∞ Follow typical “sampling process”?
149
149 UNC, Stat & OR HDLSS Asymptotics HDLSS asymptotics: n fixed, d ∞ Follow typical “sampling process”? Microarrays: # genes bounded Proteomics, SNPs, … A moot point, from perspective: Asymptotics are a tool for finding simple structure underlying complex entities
150
150 UNC, Stat & OR HDLSS Asymptotics HDLSS asymptotics: n fixed, d ∞ Say anything, as noise level increases???
151
151 UNC, Stat & OR HDLSS Asymptotics HDLSS asymptotics: n fixed, d ∞ Say anything, as noise level increases??? Yes, there exists simple, perhaps surprising, underlying structure
152
152 UNC, Stat & OR HDLSS Asymptotics: Simple Paradoxes, I For dim’al “Standard Normal” dist’n: Euclidean Distance to Origin (as ): - Data lie roughly on surface of sphere of radius - Yet origin is point of “highest density”??? - Paradox resolved by: “density w. r. t. Lebesgue Measure”
153
153 UNC, Stat & OR HDLSS Asymptotics: Simple Paradoxes, II For dim’al “Standard Normal” dist’n: indep. of Euclidean Dist. between and (as ): Distance tends to non-random constant: Can extend to Where do they all go??? (we can only perceive 3 dim’ns)
154
154 UNC, Stat & OR HDLSS Asymptotics: Simple Paradoxes, III For dim’al “Standard Normal” dist’n: indep. of High dim’al Angles (as ): - -“Everything is orthogonal”??? - Where do they all go??? (again our perceptual limitations) - Again 1st order structure is non-random
155
155 UNC, Stat & OR HDLSS Asy’s: Geometrical Representation, I Assume, let Study Subspace Generated by Data a. Hyperplane through 0, of dimension b. Points are “nearly equidistant to 0”, & dist c. Within plane, can “rotate towards Unit Simplex” d. All Gaussian data sets are“near Unit Simplex Vertices”!!! “Randomness” appears only in rotation of simplex With P. Hall & A. Neeman
156
156 UNC, Stat & OR HDLSS Asy’s: Geometrical Representation, II Assume, let Study Hyperplane Generated by Data a. dimensional hyperplane b. Points are pairwise equidistant, dist c. Points lie at vertices of “regular hedron” d. Again “randomness in data” is only in rotation e. Surprisingly rigid structure in data?
157
157 UNC, Stat & OR HDLSS Asy’s: Geometrical Representation, III Simulation View: shows “rigidity after rotation”
158
158 UNC, Stat & OR HDLSS Asy’s: Geometrical Representation, III Straightforward Generalizations: non-Gaussian data: only need moments non-independent: use “mixing conditions” (with P. Hall & A. Neeman) Mild Eigenvalue condition on Theoretical Cov. (with J. Ahn, K. Muller & Y. Chi) Mixing Condition on Stand’d & Permuted Var’s (with S. Jung) All based on simple “Laws of Large Numbers”
159
159 UNC, Stat & OR 2 nd Paper on HDLSS Asymptotics Ahn, Marron, Muller & Chi (2007) Biometrika Assume 2 nd Moments (and Gaussian) Assume no eigenvalues too large in sense: For assume i.e. (min possible) (much weaker than previous mixing conditions…)
160
160 UNC, Stat & OR HDLSS Asy’s: Geometrical Representation, IV Explanation of Observed (Simulation) Behavior: “everything similar for very high d” 2 popn’s are 2 simplices (i.e. regular n-hedrons) All are same distance from the other class i.e. everything is a support vector i.e. all sensible directions show “data piling” so “sensible methods are all nearly the same” Including 1 - NN
161
161 UNC, Stat & OR HDLSS Asy’s: Geometrical Representation, V Further Consequences of Geometric Representation 1. Inefficiency of DWD for uneven sample size (motivates “weighted version”, work in progress) 2. DWD more “stable” than SVM (based on “deeper limiting distributions”) (reflects intuitive idea “feeling sampling variation”) (something like “mean vs. median”) 3. 1-NN rule inefficiency is quantified.
162
162 UNC, Stat & OR HDLSS Math. Stat. of PCA, I Consistency & Strong Inconsistency: Spike Covariance Model (Johnstone & Paul) For Eigenvalues: 1 st Eigenvector: How good are empirical versions, as estimates?
163
163 UNC, Stat & OR HDLSS Math. Stat. of PCA, II Consistency (big enough spike): For, Strong Inconsistency (spike not big enough): For,
164
164 UNC, Stat & OR HDLSS Math. Stat. of PCA, III Consistency of eigenvalues? Eigenvalues Inconsistent But known distribution Unless as well
165
165 UNC, Stat & OR HDLSS Work in Progress, I Batch Adjustment: Xuxin Liu Recall Intuition from above: Key is sizes of biological subtypes Differing ratio trips up mean But DWD more robust Mathematics behind this?
166
166 UNC, Stat & OR Liu: Twiddle ratios of subtypes
167
167 UNC, Stat & OR HDLSS Data Combo Mathematics Xuxin Liu Dissertation Results: Simple Unbalanced Cluster Model Growing at rate as Answers depend on Visualization of setting….
168
168 UNC, Stat & OR HDLSS Data Combo Mathematics
169
169 UNC, Stat & OR HDLSS Data Combo Mathematics
170
170 UNC, Stat & OR HDLSS Data Combo Mathematics Asymptotic Results (as ): For, DWD Consistent Angle(DWD,Truth) For, DWD Strongly Inconsistent Angle(DWD,Truth)
171
171 UNC, Stat & OR HDLSS Data Combo Mathematics Asymptotic Results (as ): For, PAM Inconsistent Angle(PAM,Truth) For, DWD Strongly Inconsistent Angle(PAM,Truth)
172
172 UNC, Stat & OR HDLSS Data Combo Mathematics Value of, for sample size ratio : , only when Otherwise for, PAM Inconsistent Verifies intuitive idea in strong way
173
173 UNC, Stat & OR HDLSS Work in Progress, II Canonical Correlations: Myung Hee Lee Results similar to those for those for PCA Singular values inconsistent But directions converge under a much milder spike assumption.
174
174 UNC, Stat & OR HDLSS Work in Progress, III Conditions for Geo. Rep’n & PCA Consist.: John Kent example: Can only say: not deterministic Conclude: need some flavor of mixing
175
175 UNC, Stat & OR HDLSS Work in Progress, III Conditions for Geo. Rep’n & PCA Consist.: Conclude: need some flavor of mixing Challenge: Classical mixing conditions require notion of time ordering Not always clear, e.g. microarrays
176
176 UNC, Stat & OR HDLSS Work in Progress, III Conditions for Geo. Rep’n & PCA Consist.: Sungkyu Jung Condition: where Define: Assume: Ǝ a permutation, So that is ρ-mixing
177
177 UNC, Stat & OR HDLSS Deep Open Problem In PCA Consistency: Strong Inconsistency - spike Consistency - spike What happens at boundary ( )???
178
178 UNC, Stat & OR The Future of HDLSS Asymptotics? 1. Address your favorite statistical problem… 2. HDLSS versions of classical optimality results? 3. Continguity Approach (~Random Matrices) 4. Rates of convergence? 5. Improved Discrimination Methods? It is early days…
179
179 UNC, Stat & OR The Future of Geometrical Representation? HDLSS version of “optimality” results? “Contiguity” approach? Params depend on d? Rates of Convergence? Improvements of DWD? (e.g. other functions of distance than inverse) It is still early days …
180
180 UNC, Stat & OR Some Carry Away Lessons Atoms of the Analysis: Object Oriented Viewpoint: Object Space Feature Space DWD is attractive for HDLSS classification “Randomness” in HDLSS data is only in rotations (Modulo rotation, have constant simplex shape) How to put HDLSS asymptotics to work?
181
181 UNC, Stat & OR Object Oriented Data Analysis Potential Future Opportunity: OODA SAMSI Program 2010-2011 Interested in joining? Let’s talk
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.