1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January.

Slides:



Advertisements
Similar presentations
Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When.
Advertisements

STOR 892 Object Oriented Data Analysis Radial Distance Weighted Discrimination Jie Xiong Advised by Prof. J.S. Marron Department of Statistics and Operations.
Developable Surface Fitting to Point Clouds Martin Peternell Computer Aided Geometric Design 21(2004) Reporter: Xingwang Zhang June 19, 2005.
Independent Component Analysis Personal Viewpoint: Directions that maximize independence Motivating Context: Signal Processing “Blind Source Separation”
HDLSS Asy’s: Geometrical Represent’n Assume, let Study Subspace Generated by Data Hyperplane through 0, ofdimension Points are “nearly equidistant to 0”,
Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question.
SigClust Gaussian null distribution - Simulation Now simulate from null distribution using: where (indep.) Again rotation invariance makes this work (and.
Object Orie’d Data Analysis, Last Time Finished NCI 60 Data Started detailed look at PCA Reviewed linear algebra Today: More linear algebra Multivariate.
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
Dimensional reduction, PCA
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Object Orie’d Data Analysis, Last Time Kernel Embedding –Use linear methods in a non-linear way Support Vector Machines –Completely Non-Gaussian Classification.
1 Numerical geometry of non-rigid shapes Non-Euclidean Embedding Non-Euclidean Embedding Lecture 6 © Alexander & Michael Bronstein tosca.cs.technion.ac.il/book.
Statistics – O. R. 892 Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research University of North Carolina.
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Object Orie’d Data Analysis, Last Time OODA in Image Analysis –Landmarks, Boundary Rep ’ ns, Medial Rep ’ ns Mildly Non-Euclidean Spaces –M-rep data on.
Object Orie’d Data Analysis, Last Time HDLSS Discrimination –MD much better Maximal Data Piling –HDLSS space is a strange place Kernel Embedding –Embed.
Object Orie’d Data Analysis, Last Time Gene Cell Cycle Data Microarrays and HDLSS visualization DWD bias adjustment NCI 60 Data Today: More NCI 60 Data.
Object Orie’d Data Analysis, Last Time
Object Orie’d Data Analysis, Last Time Distance Weighted Discrimination: Revisit microarray data Face Data Outcomes Data Simulation Comparison.
Return to Big Picture Main statistical goals of OODA: Understanding population structure –Low dim ’ al Projections, PCA … Classification (i. e. Discrimination)
1 UNC, Stat & OR Nonnegative Matrix Factorization.
A Challenging Example Male Pelvis –Bladder – Prostate – Rectum.
Statistics – O. R. 891 Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research University of North Carolina.
Statistics – O. R. 891 Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research University of North Carolina.
Object Orie’d Data Analysis, Last Time Discrimination for manifold data (Sen) –Simple Tangent plane SVM –Iterated TANgent plane SVM –Manifold SVM Interesting.
Object Orie’d Data Analysis, Last Time Classification / Discrimination Classical Statistical Viewpoint –FLD “good” –GLR “better” –Conclude always do GLR.
1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, II J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina.
SWISS Score Nice Graphical Introduction:. SWISS Score Toy Examples (2-d): Which are “More Clustered?”
Object Orie’d Data Analysis, Last Time SiZer Analysis –Zooming version, -- Dependent version –Mass flux data, -- Cell cycle data Image Analysis –1 st Generation.
Object Orie’d Data Analysis, Last Time Classical Discrimination (aka Classification) –FLD & GLR very attractive –MD never better, sometimes worse HDLSS.
Maximal Data Piling Visual similarity of & ? Can show (Ahn & Marron 2009), for d < n: I.e. directions are the same! How can this be? Note lengths are different.
Common Property of Shape Data Objects: Natural Feature Space is Curved I.e. a Manifold (from Differential Geometry) Shapes As Data Objects.
1 UNC, Stat & OR PCA Extensions for Data on Manifolds Fletcher (Principal Geodesic Anal.) Best fit of geodesic to data Constrained to go through geodesic.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Return to Big Picture Main statistical goals of OODA: Understanding population structure –Low dim ’ al Projections, PCA … Classification (i. e. Discrimination)
Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.
Object Orie’d Data Analysis, Last Time SiZer Analysis –Statistical Inference for Histograms & S.P.s Yeast Cell Cycle Data OODA in Image Analysis –Landmarks,
Statistics – O. R. 892 Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research University of North Carolina.
1 UNC, Stat & OR U. C. Davis, F. R. G. Workshop Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North.
1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, I J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina.
Object Orie’d Data Analysis, Last Time PCA Redistribution of Energy - ANOVA PCA Data Representation PCA Simulation Alternate PCA Computation Primal – Dual.
Participant Presentations Please Sign Up: Name (Onyen is fine, or …) Are You ENRolled? Tentative Title (???? Is OK) When: Next Week, Early, Oct.,
Classification on Manifolds Suman K. Sen joint work with Dr. J. S. Marron & Dr. Mark Foskey.
GWAS Data Analysis. L1 PCA Challenge: L1 Projections Hard to Interpret (i.e. Little Data Insight) Solution: 1)Compute PC Directions Using L1 2)Compute.
Object Orie’d Data Analysis, Last Time Reviewed Clustering –2 means Cluster Index –SigClust When are clusters really there? Q-Q Plots –For assessing Goodness.
1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, III J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina.
Return to Big Picture Main statistical goals of OODA: Understanding population structure –Low dim ’ al Projections, PCA … Classification (i. e. Discrimination)
PCA Data Represent ’ n (Cont.). PCA Simulation Idea: given Mean Vector Eigenvectors Eigenvalues Simulate data from Corresponding Normal Distribution.
Object Orie’d Data Analysis, Last Time Organizational Matters
Kernel Embedding Polynomial Embedding, Toy Example 3: Donut FLD Good Performance (Slice of Paraboloid)
Cornea Data Main Point: OODA Beyond FDA Recall Interplay: Object Space  Descriptor Space.
Recall Flexibility From Kernel Embedding Idea HDLSS Asymptotics & Kernel Methods.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Distance Weighted Discrim ’ n Based on Optimization Problem: For “Residuals”:
SigClust Statistical Significance of Clusters in HDLSS Data When is a cluster “really there”? Liu et al (2007), Huang et al (2014)
Object Orie’d Data Analysis, Last Time DiProPerm Test –Direction – Projection – Permutation –HDLSS hypothesis testing –NCI 60 Data –Particulate Matter.
1 UNC, Stat & OR Place Name OODA of Tree Structured Objects J. S. Marron Dept. of Statistics and Operations Research October 2, 2016.
Landmark Based Shapes As Data Objects
Return to Big Picture Main statistical goals of OODA:
Object Orie’d Data Analysis, Last Time
Object Orie’d Data Analysis, Last Time
Statistics – O. R. 881 Object Oriented Data Analysis
Maximal Data Piling MDP in Increasing Dimensions:
Participant Presentations
Landmark Based Shape Analysis
Principal Nested Spheres Analysis
Today is Last Class Meeting
HDLSS Discrimination Mean Difference (Centroid) Method Same Data, Movie over dim’s.
“good visual impression”
Presentation transcript:

1 UNC, Stat & OR ??? Place ??? Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina January 19, 2016

2 UNC, Stat & OR Interdisciplinary Relationship How does: Statistics Relate to: Mathematics? (probability, optimization, geometry, …)

3 UNC, Stat & OR Statistics - Mathematics Relationship Mathematical Statistics: Validation of existing methods Asymptotics (n  ∞) & Taylor expansion Comparison of existing methods (requires hard math, but really “accounting”???)

4 UNC, Stat & OR Statistics - Mathematics Relationship Suggested New Relationship: Put Mathematics to work to Generate New Statistical Ideas/Approaches (publishable in the Ann. Stat.???)

5 UNC, Stat & OR Personal Opinions on Mathematical Statistics What is Mathematical Statistics? Validation of existing methods Asymptotics (n  ∞) & Taylor expansion Comparison of existing methods (requires hard math, but really “accounting”???)

6 UNC, Stat & OR Personal Opinions on Mathematical Statistics What could Mathematical Statistics be? Basis for invention of new methods Complicated data  mathematical ideas Do we value creativity? Since we don’t do this, others do… (where are the $$$s???)

7 UNC, Stat & OR Personal Opinions on Mathematical Statistics Since we don’t do this, others do… Pattern Recognition Artificial Intelligence Neural Nets Data Mining Machine Learning ???

8 UNC, Stat & OR Personal Opinions on Mathematical Statistics Possible Litmus Test: Creative Statistics  Clinical Trials Viewpoint: Worst Imaginable Idea  Mathematical Statistics Viewpoint: ???

9 UNC, Stat & OR Object Oriented Data Analysis, I What is the “atom” of a statistical analysis? 1 st Course: Numbers Multivariate Analysis Course : Vectors Functional Data Analysis: Curves More generally: Data Objects

10 UNC, Stat & OR Functional Data Analysis, I Curves as Data Objects Important Duality: Curve Space  Point Cloud Space Illustrate with Travis Gaydos Graphics 2 dim’al curves (easy to visualize)

11 UNC, Stat & OR Functional Data Analysis, Toy EG I

12 UNC, Stat & OR Functional Data Analysis, Toy EG II

13 UNC, Stat & OR Functional Data Analysis, Toy EG III

14 UNC, Stat & OR Functional Data Analysis, Toy EG IV

15 UNC, Stat & OR Functional Data Analysis, Toy EG V

16 UNC, Stat & OR Functional Data Analysis, Toy EG VI

17 UNC, Stat & OR Functional Data Analysis, Toy EG VII

18 UNC, Stat & OR Functional Data Analysis, Toy EG VIII

19 UNC, Stat & OR Functional Data Analysis, Toy EG IX

20 UNC, Stat & OR Functional Data Analysis, Toy EG X

21 UNC, Stat & OR Functional Data Analysis, 10-d Toy EG 1

22 UNC, Stat & OR Functional Data Analysis, 10-d Toy EG 1

23 UNC, Stat & OR Functional Data Analysis, 10-d Toy EG 2

24 UNC, Stat & OR Functional Data Analysis, 10-d Toy EG 2

25 UNC, Stat & OR Object Oriented Data Analysis, I What is the “atom” of a statistical analysis? 1 st Course: Numbers Multivariate Analysis Course : Vectors Functional Data Analysis: Curves More generally: Data Objects

26 UNC, Stat & OR Object Oriented Data Analysis, II Examples: Medical Image Analysis Images as Data Objects? Shape Representations as Objects Micro-arrays for Gene Expression Just multivariate analysis?

27 UNC, Stat & OR Object Oriented Data Analysis, III Typical Goals: Understanding population variation Visualization Principal Component Analysis + Discrimination (a.k.a. Classification) Time Series of Data Objects

28 UNC, Stat & OR Object Oriented Data Analysis, IV Major Statistical Challenge, I: High Dimension Low Sample Size (HDLSS) Dimension d >> sample size n “Multivariate Analysis” nearly useless Can’t “normalize the data” Land of Opportunity for Statisticians Need for “creative statisticians”

29 UNC, Stat & OR Object Oriented Data Analysis, V Major Statistical Challenge, II: Data may live in non-Euclidean space Lie Group / Symmet’c Spaces (manifold data) Trees/Graphs as data objects Interesting Issues: What is “the mean” (pop’n center)? How do we quantify “pop’n variation”?

30 UNC, Stat & OR Statistics in Image Analysis, I First Generation Problems: Denoising Segmentation Registration (all about single images)

31 UNC, Stat & OR Statistics in Image Analysis, II Second Generation Problems: Populations of Images Understanding Population Variation Discrimination (a.k.a. Classification) Complex Data Structures (& Spaces) HDLSS Statistics

32 UNC, Stat & OR HDLSS Statistics in Imaging Why HDLSS (High Dim, Low Sample Size)? Complex 3-d Objects Hard to Represent Often need d = 100’s of parameters Complex 3-d Objects Costly to Segment Often have n = 10’s cases

33 UNC, Stat & OR Medical Imaging – A Challenging Example Male Pelvis Bladder – Prostate – Rectum How do they move over time (days)? Critical to Radiation Treatment (cancer) Work with 3-d CT Very Challenging to Segment Find boundary of each object? Represent each Object?

34 UNC, Stat & OR Male Pelvis – Raw Data One CT Slice (in 3d image) Coccyx (Tail Bone) Rectum Bladder

35 UNC, Stat & OR Male Pelvis – Raw Data Bladder: manual segmentation Slice by slice Reassembled

36 UNC, Stat & OR Male Pelvis – Raw Data Bladder: Slices: Reassembled in 3d How to represent? Thanks: Ja-Yeon Jeong

37 UNC, Stat & OR Object Representation Landmarks (hard to find) Boundary Rep’ns (no correspondence) Medial representations Find “skeleton” Discretize as “atoms” called M-reps

38 UNC, Stat & OR 3-d m-reps Bladder – Prostate – Rectum (multiple objects, J. Y. Jeong) Medial Atoms provide “skeleton” Implied Boundary from “spokes”  “surface”

39 UNC, Stat & OR 3-d m-reps M-rep model fitting Easy, when starting from binary (blue) But very expensive (30 – 40 minutes technician’s time) Want automatic approach Challenging, because of poor contrast, noise, … Need to borrow information across training sample Use Bayes approach: prior & likelihood  posterior ~Conjugate Gaussians, but there are issues: Major HLDSS challenges Manifold aspect of data

40 UNC, Stat & OR Illuminating Viewpoint Object Space  Feature Space Focus here on collection of data objects Here conceptualize population structure via “point clouds”

41 UNC, Stat & OR Personal HDLSS Viewpoint: Data Images (cases) are “Points” In Feature Space Features are Axes Data set is “Point Clouds” Use Proj’ns to visualize

42 UNC, Stat & OR Personal HDLSS Viewpoint: PCA Rotated Axes Often Insightful One set of Dir’ns Others Useful, too

43 UNC, Stat & OR Cornea Data, I Images as data ~42 Cornea Images Outer surface of eye Heat map of curvature (in radial direction) Hard to understand “population structure”

44 UNC, Stat & OR Cornea Data, II PC 1 Starts at Pop’n Mean Overall Curvature Vertical Astigmatism Correlated! Gaussian Projections Visualization: Can’t Overlay (so use movie)

45 UNC, Stat & OR Cornea Data, III PC 2 Horrible Outlier! (present in data) But look only in center: Steep at top -- bottom Want Robust PCA For HDLSS data ???

46 UNC, Stat & OR Cornea Data, IV Robust PC 2 No outlier impact See top – bottom variation Projections now Gaussian

47 UNC, Stat & OR PCA for m-reps, I Major issue: m-reps live in (locations, radius and angles) E.g. “average” of: = ??? Natural Data Structure is: Lie Groups ~ Symmetric spaces (smooth, curved manifolds)

48 UNC, Stat & OR PCA for m-reps, II PCA on non-Euclidean spaces? (i.e. on Lie Groups / Symmetric Spaces) T. Fletcher: Principal Geodesic Analysis Idea: replace “linear summary of data” With “geodesic summary of data”…

49 UNC, Stat & OR PGA for m-reps, Bladder-Prostate-Rectum Bladder – Prostate – Rectum, 1 person, 17 days PG 1 PG 2 PG 3 (analysis by Ja Yeon Jeong)

50 UNC, Stat & OR PGA for m-reps, Bladder-Prostate-Rectum Bladder – Prostate – Rectum, 1 person, 17 days PG 1 PG 2 PG 3 (analysis by Ja Yeon Jeong)

51 UNC, Stat & OR PGA for m-reps, Bladder-Prostate-Rectum Bladder – Prostate – Rectum, 1 person, 17 days PG 1 PG 2 PG 3 (analysis by Ja Yeon Jeong)

52 UNC, Stat & OR HDLSS Classification (i.e. Discrimination) Background: Two Class (Binary) version: Using “training data” from Class +1, and from Class -1 Develop a “rule” for assigning new data to a Class Canonical Example: Disease Diagnosis New Patients are “Healthy” or “Ill” Determined based on measurements

53 UNC, Stat & OR HDLSS Classification (Cont.) Ineffective Methods: Fisher Linear Discrimination Gaussian Likelihood Ratio Less Useful Methods: Nearest Neighbors Neural Nets (“black boxes”, no “directions” or intuition)

54 UNC, Stat & OR HDLSS Classification (Cont.) Currently Fashionable Methods: Support Vector Machines Trees Based Approaches New High Tech Method Distance Weighted Discrimination (DWD) Specially designed for HDLSS data Avoids “data piling” problem of SVM Solves more suitable optimization problem

55 UNC, Stat & OR HDLSS Classification (Cont.) Currently Fashionable Methods: Trees Based Approaches Support Vector Machines:

56 UNC, Stat & OR Kernel Embedding Idea Aizerman, Braverman, Rozoner (1964) Make data linearly separable by embedding in higher dimensional space

57 UNC, Stat & OR Kernel Embedding Idea Linearly separable by embedding in higher dimensions

58 UNC, Stat & OR Kernel Embedding Idea Linearly separable by embedding in higher dimensions

59 UNC, Stat & OR Kernel Embedding Idea Linearly separable by embedding in higher dimensions

60 UNC, Stat & OR Kernel Embedding Idea Linearly separable by embedding in higher dimensions

61 UNC, Stat & OR Kernel Embedding Idea Linearly separable by embedding in higher dimensions Distributional Assumptions in Embedded Space? ǁ ˅ Support Vector Machine

62 UNC, Stat & OR HDLSS Classification (Cont.) Comparison of Linear Methods (toy data): Optimal Direction Excellent, but need dir’n in dim = 50 Maximal Data Piling (J. Y. Ahn, D. Peña) Great separation, but generalizability??? Support Vector Machine More separation, gen’ity, but some data piling? Distance Weighted Discrimination Avoids data piling, good gen’ity, Gaussians?

63 UNC, Stat & OR Distance Weighted Discrimination Maximal Data Piling

64 UNC, Stat & OR Distance Weighted Discrimination Based on Optimization Problem: More precisely work in appropriate penalty for violations Optimization Method (Michael Todd): Second Order Cone Programming Still Convex gen’tion of quadratic prog’ing Fast greedy solution Can use existing software

65 UNC, Stat & OR Simulation Comparison E.G. Above Gaussians: Wide array of dim’s SVM Subst’ly worse MD – Bayes Optimal DWD close to MD

66 UNC, Stat & OR Simulation Comparison E.G. Outlier Mixture: Disaster for MD SVM & DWD much more solid Dir’ns are “robust” SVM & DWD similar

67 UNC, Stat & OR Simulation Comparison E.G. Wobble Mixture: Disaster for MD SVM less good DWD slightly better Note: All methods come together for larger d ???

68 UNC, Stat & OR DWD Bias Adjustment for Microarrays Microarray data: Simult. Measur’ts of “gene expression” Intrinsically HDLSS Dimension d ~ 1,000s – 10,000s Sample Sizes n ~ 10s – 100s My view: Each array is “point in cloud”

69 UNC, Stat & OR DWD Batch and Source Adjustment For Perou ’ s Stanford Breast Cancer Data Analysis in Benito, et al (2004) Bioinformatics Adjust for Source Effects Different sources of mRNA Adjust for Batch Effects Arrays fabricated at different times

70 UNC, Stat & OR DWD Adj: Raw Breast Cancer data

71 UNC, Stat & OR DWD Adj: Source Colors

72 UNC, Stat & OR DWD Adj: Batch Colors

73 UNC, Stat & OR DWD Adj: Biological Class Colors

74 UNC, Stat & OR DWD Adj: Biological Class Colors & Symbols

75 UNC, Stat & OR DWD Adj: Biological Class Symbols

76 UNC, Stat & OR DWD Adj: Source Colors

77 UNC, Stat & OR DWD Adj: PC 1-2 & DWD direction

78 UNC, Stat & OR DWD Adj: DWD Source Adjustment

79 UNC, Stat & OR DWD Adj: Source Adj’d, PCA view

80 UNC, Stat & OR DWD Adj: Source Adj’d, Class Colored

81 UNC, Stat & OR DWD Adj: Source Adj’d, Batch Colored

82 UNC, Stat & OR DWD Adj: Source Adj’d, 5 PCs

83 UNC, Stat & OR DWD Adj: S. Adj’d, Batch 1,2 vs. 3 DWD

84 UNC, Stat & OR DWD Adj: S. & B1,2 vs. 3 Adjusted

85 UNC, Stat & OR DWD Adj: S. & B1,2 vs. 3 Adj’d, 5 PCs

86 UNC, Stat & OR DWD Adj: S. & B Adj’d, B1 vs. 2 DWD

87 UNC, Stat & OR DWD Adj: S. & B Adj’d, B1 vs. 2 Adj’d

88 UNC, Stat & OR DWD Adj: S. & B Adj’d, 5 PC view

89 UNC, Stat & OR DWD Adj: S. & B Adj’d, 4 PC view

90 UNC, Stat & OR DWD Adj: S. & B Adj’d, Class Colors

91 UNC, Stat & OR DWD Adj: S. & B Adj’d, Adj’d PCA

92 UNC, Stat & OR DWD Bias Adjustment for Microarrays Effective for Batch and Source Adj. Also works for cross-platform Adj. E.g. cDNA & Affy Despite literature claiming contrary “Gene by Gene” vs. “Multivariate” views Funded as part of caBIG “Cancer BioInformatics Grid” “Data Combination Effort” of NCI

93 UNC, Stat & OR Interesting Benchmark Data Set NCI 60 Cell Lines Interesting benchmark, since same cells Data Web available: Both cDNA and Affymetrix Platforms 8 Major cancer subtypes Use DWD now for visualization

94 UNC, Stat & OR NCI 60: Fully Adjusted Data, Leukemia Cluster LEUK.CCRFCEM LEUK.K562 LEUK.MOLT4 LEUK.HL60 LEUK.RPMI8266 LEUK.SR

95 UNC, Stat & OR NCI 60: Views using DWD Dir’ns (focus on biology)

96 UNC, Stat & OR Why not adjust by means? DWD is complicated: value added? Xuxin Liu example… Key is sizes of biological subtypes Differing ratio trips up mean But DWD more robust (although still not perfect)

97 UNC, Stat & OR Twiddle ratios of subtypes

98 UNC, Stat & OR DWD in Face Recognition, I Face Images as Data (with M. Benito & D. Peña) Registered using landmarks Male – Female Difference? Discrimination Rule?

99 UNC, Stat & OR DWD in Face Recognition, II DWD Direction Good separation Images “make sense” Garbage at ends? (extrapolation effects?)

100 UNC, Stat & OR DWD in Face Recognition, III Interesting summary: Jump between means (in DWD direction) Clear separation of Maleness vs. Femaleness

101 UNC, Stat & OR DWD in Face Recognition, IV Fun Comparison: Jump between means (in SVM direction) Also distinguishes Maleness vs. Femaleness But not as well as DWD

102 UNC, Stat & OR DWD in Face Recognition, V Analysis of difference: Project onto normals SVM has “small gap” (feels noise artifacts?) DWD “more informative” (feels real structure?)

103 UNC, Stat & OR DWD in Face Recognition, VI Current Work: Focus on “drivers”: (regions of interest) Relation to Discr’n? Which is “best”? Lessons for human perception?

104 UNC, Stat & OR Time Series of Curves Chemical Spectra, evolving over time (with J. Wendelberger & E. Kober) Mortality curves changing in time (with Andres Alonzo)

105 UNC, Stat & OR Discrimination for m-reps Classification for Lie Groups – Symm. Spaces S. K. Sen, S. Joshi & M. Foskey What is “separating plane” (for SVM-DWD)?

106 UNC, Stat & OR Blood vessel tree data Marron’s brain:  Segmented from MRA  Reconstruct trees  in 3d  Rotate to view

107 UNC, Stat & OR Blood vessel tree data Marron’s brain:  Segmented from MRA  Reconstruct trees  in 3d  Rotate to view

108 UNC, Stat & OR Blood vessel tree data Marron’s brain:  Segmented from MRA  Reconstruct trees  in 3d  Rotate to view

109 UNC, Stat & OR Blood vessel tree data Marron’s brain:  Segmented from MRA  Reconstruct trees  in 3d  Rotate to view

110 UNC, Stat & OR Marron’s brain:  Segmented from MRA  Reconstruct trees  in 3d  Rotate to view Blood vessel tree data

111 UNC, Stat & OR Blood vessel tree data Marron’s brain:  Segmented from MRA  Reconstruct trees  in 3d  Rotate to view

112 UNC, Stat & OR Blood vessel tree data Now look over many people (data objects) Structure of population (understand variation?) PCA in strongly non-Euclidean Space???,...,,

113 UNC, Stat & OR Blood vessel tree data Possible focus of analysis: Connectivity structure only (topology) Location, size, orientation of segments Structure within each vessel segment,...,,

114 UNC, Stat & OR Blood vessel tree data Present Focus: Topology only  Already challenging  Later address others  Then add attributes  To tree nodes  And extend analysis

115 UNC, Stat & OR Blood vessel tree data The tree team:  Very Interdsciplinary  Neurosurgery:  Bullitt, Ladha  Statistics:  Wang, Marron  Optimization:  Aydin, Pataki

116 UNC, Stat & OR Blood vessel tree data Recall from above: Marron’s brain:  Focus on back  Connectivity (topology) only

117 UNC, Stat & OR Blood vessel tree data Present Focus:  Topology only  Raw data as trees  Marron’s reduced tree  Back tree only

118 UNC, Stat & OR Blood vessel tree data Topology only E.g. Back Trees Full Population Study as movie Understand variation?

119 UNC, Stat & OR Strongly Non-Euclidean Spaces Statistics on Population of Tree-Structured Data Objects? Mean??? Analog of PCA??? Strongly non-Euclidean, since: Space of trees not a linear space Not even approximately linear (no tangent plane)

120 UNC, Stat & OR Mildly Non-Euclidean Spaces Useful View of Manifold Data: Tangent Space Center: Frech é t Mean Reason for terminology “ mildly non Euclidean ”

121 UNC, Stat & OR Strongly Non-Euclidean Spaces Mean of Population of Tree-Structured Data Objects? Natural approach: Fr é chet mean Requires a metric (distance) on tree space

122 UNC, Stat & OR Strongly Non-Euclidean Spaces PCA on Tree Space? Recall Conventional PCA: Directions that explain structure in data Data are points in point cloud 1-d and 2-d projections allow insights about population structure

123 UNC, Stat & OR Illust’n of PCA View: PC1 Projections

124 UNC, Stat & OR Illust’n of PCA View: Projections on PC1,2 plane

125 UNC, Stat & OR Source Batch Adj: PC 1-3 & DWD direction

126 UNC, Stat & OR Source Batch Adj: DWD Source Adjustment

127 UNC, Stat & OR Strongly Non-Euclidean Spaces PCA on Tree Space? Key Idea (Jim Ramsay): Replace 1-d subspace that best approximates data By 1-d representation that best approximates data Wang and Marron (2007) define notion of Treeline (in structure space)

128 UNC, Stat & OR Strongly Non-Euclidean Spaces PCA on Tree Space: Treeline Best 1-d representation of data Basic idea: From some starting tree Grow only in 1 “direction”

129 UNC, Stat & OR Strongly Non-Euclidean Spaces PCA on Tree Space: Treeline Best 1-d representation of data Problem: Hard to compute In particular: to solve optimization problem Wang and Marron (2007) Maximum 4 vessel trees Hard to tackle serious trees (e.g. blood vessel trees)

130 UNC, Stat & OR Strongly Non-Euclidean Spaces PCA on Tree Space: Treeline Problem: Hard to compute Solution: Burcu Aydin & Gabor Pataki (linear time algorithm) (based on clever “reformulation” of problem)

131 UNC, Stat & OR PCA for blood vessel tree data PCA on Tree Space: Treelines Interesting to compare: Population of Left Trees Population of Right Trees Population of Back Trees And to study 1 st, 2 nd, 3 rd & 4 th treelines

132 UNC, Stat & OR PCA for blood vessel tree data Study “Directions” 1, 2, 3, 4 For sub- populations B, L, R (interpret later)

133 UNC, Stat & OR PCA for blood vessel tree data Notes on Treeline Directions: PC1 always to left BACK has most variation to right (PC2) LEFT has more varia’n to 2 nd level (PC2) RIGHT has more var’n to 1 st level (PC2) See these in the data?

134 UNC, Stat & OR PCA for blood vessel tree data Notes: PC1 – all left PC2: BACK - right LEFT 2 nd lev RIGHT 1 st lev See these??

135 UNC, Stat & OR Strongly Non-Euclidean Spaces PCA on Tree Space: Treeline Next represent data as projections Define as closest point in tree line (same as Euclidean PCA) Have corresponding score (length of projection along line) And analog of residual (distance from data point to projection)

136 UNC, Stat & OR PCA for blood vessel tree data Individual (each PC separately) Scores Plot

137 UNC, Stat & OR PCA for blood vessel tree data Data Analytic Goals: Age, Gender See these? No…

138 UNC, Stat & OR PCA for blood vessel tree data Directly study age  PC scores PC1 + PC2 - Thickness Not Sig’t - Descendants Left Sig’t

139 UNC, Stat & OR Upcoming New Approach Replace Tree-Lines by Tree-Curves:

140 UNC, Stat & OR Upcoming New Approach Projections on Tree-Curves:

141 UNC, Stat & OR Preliminary Tree-Curve Results First Correlation Of Structure To Age! (Back Trees)

142 UNC, Stat & OR Preliminary Tree-Curve Results But does not appear everywhere (Left Trees) Finding locality!

143 UNC, Stat & OR HDLSS Asymptotics Why study asymptotics?

144 UNC, Stat & OR HDLSS Asymptotics Why study asymptotics?  An interesting (naïve) quote: “I don’t look at asymptotics, because I don’t have an infinite sample size”

145 UNC, Stat & OR HDLSS Asymptotics Why study asymptotics?  An interesting (naïve) quote: “I don’t look at asymptotics, because I don’t have an infinite sample size”  Suggested perspective: Asymptotics are a tool for finding simple structure underlying complex entities

146 UNC, Stat & OR HDLSS Asymptotics Which asymptotics?  n  ∞ (classical, very widely done)  d  ∞ ???  Sensible?  Follow typical “sampling process”?  Say anything, as noise level increases???

147 UNC, Stat & OR HDLSS Asymptotics Which asymptotics?  n  ∞ & d  ∞  n >> d: a few results around (still have classical info in data)  n ~ d: random matrices (Iain J., et al) (nothing classically estimable)  HDLSS asymptotics: n fixed, d  ∞

148 UNC, Stat & OR HDLSS Asymptotics HDLSS asymptotics: n fixed, d  ∞  Follow typical “sampling process”?

149 UNC, Stat & OR HDLSS Asymptotics HDLSS asymptotics: n fixed, d  ∞  Follow typical “sampling process”?  Microarrays: # genes bounded  Proteomics, SNPs, …  A moot point, from perspective: Asymptotics are a tool for finding simple structure underlying complex entities

150 UNC, Stat & OR HDLSS Asymptotics HDLSS asymptotics: n fixed, d  ∞  Say anything, as noise level increases???

151 UNC, Stat & OR HDLSS Asymptotics HDLSS asymptotics: n fixed, d  ∞  Say anything, as noise level increases??? Yes, there exists simple, perhaps surprising, underlying structure

152 UNC, Stat & OR HDLSS Asymptotics: Simple Paradoxes, I For dim’al “Standard Normal” dist’n: Euclidean Distance to Origin (as ): - Data lie roughly on surface of sphere of radius - Yet origin is point of “highest density”??? - Paradox resolved by: “density w. r. t. Lebesgue Measure”

153 UNC, Stat & OR HDLSS Asymptotics: Simple Paradoxes, II For dim’al “Standard Normal” dist’n: indep. of Euclidean Dist. between and (as ): Distance tends to non-random constant: Can extend to Where do they all go??? (we can only perceive 3 dim’ns)

154 UNC, Stat & OR HDLSS Asymptotics: Simple Paradoxes, III For dim’al “Standard Normal” dist’n: indep. of High dim’al Angles (as ): - -“Everything is orthogonal”??? - Where do they all go??? (again our perceptual limitations) - Again 1st order structure is non-random

155 UNC, Stat & OR HDLSS Asy’s: Geometrical Representation, I Assume, let Study Subspace Generated by Data a. Hyperplane through 0, of dimension b. Points are “nearly equidistant to 0”, & dist c. Within plane, can “rotate towards Unit Simplex” d. All Gaussian data sets are“near Unit Simplex Vertices”!!! “Randomness” appears only in rotation of simplex With P. Hall & A. Neeman

156 UNC, Stat & OR HDLSS Asy’s: Geometrical Representation, II Assume, let Study Hyperplane Generated by Data a. dimensional hyperplane b. Points are pairwise equidistant, dist c. Points lie at vertices of “regular hedron” d. Again “randomness in data” is only in rotation e. Surprisingly rigid structure in data?

157 UNC, Stat & OR HDLSS Asy’s: Geometrical Representation, III Simulation View: shows “rigidity after rotation”

158 UNC, Stat & OR HDLSS Asy’s: Geometrical Representation, III Straightforward Generalizations: non-Gaussian data: only need moments non-independent: use “mixing conditions” (with P. Hall & A. Neeman) Mild Eigenvalue condition on Theoretical Cov. (with J. Ahn, K. Muller & Y. Chi) Mixing Condition on Stand’d & Permuted Var’s (with S. Jung) All based on simple “Laws of Large Numbers”

159 UNC, Stat & OR 2 nd Paper on HDLSS Asymptotics Ahn, Marron, Muller & Chi (2007) Biometrika  Assume 2 nd Moments (and Gaussian)  Assume no eigenvalues too large in sense: For assume i.e. (min possible) (much weaker than previous mixing conditions…)

160 UNC, Stat & OR HDLSS Asy’s: Geometrical Representation, IV Explanation of Observed (Simulation) Behavior: “everything similar for very high d” 2 popn’s are 2 simplices (i.e. regular n-hedrons) All are same distance from the other class i.e. everything is a support vector i.e. all sensible directions show “data piling” so “sensible methods are all nearly the same” Including 1 - NN

161 UNC, Stat & OR HDLSS Asy’s: Geometrical Representation, V Further Consequences of Geometric Representation 1. Inefficiency of DWD for uneven sample size (motivates “weighted version”, work in progress) 2. DWD more “stable” than SVM (based on “deeper limiting distributions”) (reflects intuitive idea “feeling sampling variation”) (something like “mean vs. median”) 3. 1-NN rule inefficiency is quantified.

162 UNC, Stat & OR HDLSS Math. Stat. of PCA, I Consistency & Strong Inconsistency: Spike Covariance Model (Johnstone & Paul) For Eigenvalues: 1 st Eigenvector: How good are empirical versions, as estimates?

163 UNC, Stat & OR HDLSS Math. Stat. of PCA, II Consistency (big enough spike): For, Strong Inconsistency (spike not big enough): For,

164 UNC, Stat & OR HDLSS Math. Stat. of PCA, III Consistency of eigenvalues?  Eigenvalues Inconsistent  But known distribution  Unless as well

165 UNC, Stat & OR HDLSS Work in Progress, I Batch Adjustment: Xuxin Liu Recall Intuition from above: Key is sizes of biological subtypes Differing ratio trips up mean But DWD more robust Mathematics behind this?

166 UNC, Stat & OR Liu: Twiddle ratios of subtypes

167 UNC, Stat & OR HDLSS Data Combo Mathematics Xuxin Liu Dissertation Results:  Simple Unbalanced Cluster Model  Growing at rate as  Answers depend on Visualization of setting….

168 UNC, Stat & OR HDLSS Data Combo Mathematics

169 UNC, Stat & OR HDLSS Data Combo Mathematics

170 UNC, Stat & OR HDLSS Data Combo Mathematics Asymptotic Results (as ):  For, DWD Consistent Angle(DWD,Truth)  For, DWD Strongly Inconsistent Angle(DWD,Truth)

171 UNC, Stat & OR HDLSS Data Combo Mathematics Asymptotic Results (as ):  For, PAM Inconsistent Angle(PAM,Truth)  For, DWD Strongly Inconsistent Angle(PAM,Truth)

172 UNC, Stat & OR HDLSS Data Combo Mathematics Value of, for sample size ratio : , only when  Otherwise for, PAM Inconsistent  Verifies intuitive idea in strong way

173 UNC, Stat & OR HDLSS Work in Progress, II Canonical Correlations: Myung Hee Lee  Results similar to those for those for PCA  Singular values inconsistent  But directions converge under a much milder spike assumption.

174 UNC, Stat & OR HDLSS Work in Progress, III Conditions for Geo. Rep’n & PCA Consist.: John Kent example: Can only say: not deterministic Conclude: need some flavor of mixing

175 UNC, Stat & OR HDLSS Work in Progress, III Conditions for Geo. Rep’n & PCA Consist.: Conclude: need some flavor of mixing Challenge: Classical mixing conditions require notion of time ordering Not always clear, e.g. microarrays

176 UNC, Stat & OR HDLSS Work in Progress, III Conditions for Geo. Rep’n & PCA Consist.: Sungkyu Jung Condition: where Define: Assume: Ǝ a permutation, So that is ρ-mixing

177 UNC, Stat & OR HDLSS Deep Open Problem In PCA Consistency:  Strong Inconsistency - spike  Consistency - spike What happens at boundary ( )???

178 UNC, Stat & OR The Future of HDLSS Asymptotics? 1. Address your favorite statistical problem… 2. HDLSS versions of classical optimality results? 3. Continguity Approach (~Random Matrices) 4. Rates of convergence? 5. Improved Discrimination Methods? It is early days…

179 UNC, Stat & OR The Future of Geometrical Representation? HDLSS version of “optimality” results? “Contiguity” approach? Params depend on d? Rates of Convergence? Improvements of DWD? (e.g. other functions of distance than inverse) It is still early days …

180 UNC, Stat & OR Some Carry Away Lessons Atoms of the Analysis: Object Oriented Viewpoint: Object Space  Feature Space DWD is attractive for HDLSS classification “Randomness” in HDLSS data is only in rotations (Modulo rotation, have constant simplex shape) How to put HDLSS asymptotics to work?

181 UNC, Stat & OR Object Oriented Data Analysis Potential Future Opportunity: OODA SAMSI Program Interested in joining? Let’s talk