Statistics – O. R. 892 Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research University of North Carolina.

Slides:



Advertisements
Similar presentations
Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When.
Advertisements

Surface normals and principal component analysis (PCA)
Object Orie’d Data Analysis, Last Time Finished NCI 60 Data Started detailed look at PCA Reviewed linear algebra Today: More linear algebra Multivariate.
Participant Presentations Please Sign Up: Name (Onyen is fine, or …) Are You ENRolled? Tentative Title (???? Is OK) When: Next Week, Early, Oct.,
LISA Short Course Series Multivariate Analysis in R Liang (Sally) Shan March 3, 2015 LISA: Multivariate Analysis in RMar. 3, 2015.
© 2003 by Davi GeigerComputer Vision September 2003 L1.1 Face Recognition Recognized Person Face Recognition.
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
Dimensional reduction, PCA
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.
Face Recognition Jeremy Wyatt.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Prof.Dr.Cevdet Demir
Matlab Software To Do Analyses as in Marron’s Talks Matlab Available from UNC Site License Download Software: Google “Marron Software”
Object Orie’d Data Analysis, Last Time Finished Algebra Review Multivariate Probability Review PCA as an Optimization Problem (Eigen-decomp. gives rotation,
Empirical Modeling Dongsup Kim Department of Biosystems, KAIST Fall, 2004.
Chapter 2 Dimensionality Reduction. Linear Methods
Statistics – O. R. 892 Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research University of North Carolina.
Object Orie’d Data Analysis, Last Time Gene Cell Cycle Data Microarrays and HDLSS visualization DWD bias adjustment NCI 60 Data Today: More NCI 60 Data.
Object Orie’d Data Analysis, Last Time
Discriminant Function Analysis Basics Psy524 Andrew Ainsworth.
Feature extraction 1.Introduction 2.T-test 3.Signal Noise Ratio (SNR) 4.Linear Correlation Coefficient (LCC) 5.Principle component analysis (PCA) 6.Linear.
1 UNC, Stat & OR Nonnegative Matrix Factorization.
Statistics – O. R. 891 Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research University of North Carolina.
Robust PCA Robust PCA 3: Spherical PCA. Robust PCA.
Object Orie’d Data Analysis, Last Time Discrimination for manifold data (Sen) –Simple Tangent plane SVM –Iterated TANgent plane SVM –Manifold SVM Interesting.
Object Orie’d Data Analysis, Last Time Classification / Discrimination Classical Statistical Viewpoint –FLD “good” –GLR “better” –Conclude always do GLR.
1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, II J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina.
Stat 31, Section 1, Last Time Time series plots Numerical Summaries of Data: –Center: Mean, Medial –Spread: Range, Variance, S.D., IQR 5 Number Summary.
SWISS Score Nice Graphical Introduction:. SWISS Score Toy Examples (2-d): Which are “More Clustered?”
Maximal Data Piling Visual similarity of & ? Can show (Ahn & Marron 2009), for d < n: I.e. directions are the same! How can this be? Note lengths are different.
1 UNC, Stat & OR PCA Extensions for Data on Manifolds Fletcher (Principal Geodesic Anal.) Best fit of geodesic to data Constrained to go through geodesic.
Module III Multivariate Analysis Techniques- Framework, Factor Analysis, Cluster Analysis and Conjoint Analysis Research Report.
Return to Big Picture Main statistical goals of OODA: Understanding population structure –Low dim ’ al Projections, PCA … Classification (i. e. Discrimination)
PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Richard Brereton
Principle Component Analysis and its use in MA clustering Lecture 12.
Participant Presentations Please Sign Up: Name (Onyen is fine, or …) Are You ENRolled? Tentative Title (???? Is OK) When: Thurs., Early, Oct., Nov.,
Stat 31, Section 1, Last Time Course Organization & Website What is Statistics? Data types.
Statistics – O. R. 893 Object Oriented Data Analysis Steve Marron Dept. of Statistics and Operations Research University of North Carolina.
Statistics – O. R. 892 Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research University of North Carolina.
Statistics – O. R. 893 Object Oriented Data Analysis Steve Marron Dept. of Statistics and Operations Research University of North Carolina.
Object Orie’d Data Analysis, Last Time PCA Redistribution of Energy - ANOVA PCA Data Representation PCA Simulation Alternate PCA Computation Primal – Dual.
1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, I J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina.
Object Orie’d Data Analysis, Last Time PCA Redistribution of Energy - ANOVA PCA Data Representation PCA Simulation Alternate PCA Computation Primal – Dual.
1 Statistics & R, TiP, 2011/12 Multivariate Methods  Multivariate data  Data display  Principal component analysis Unsupervised learning technique 
Participant Presentations Please Sign Up: Name (Onyen is fine, or …) Are You ENRolled? Tentative Title (???? Is OK) When: Next Week, Early, Oct.,
GWAS Data Analysis. L1 PCA Challenge: L1 Projections Hard to Interpret (i.e. Little Data Insight) Solution: 1)Compute PC Directions Using L1 2)Compute.
1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, III J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina.
PCA Data Represent ’ n (Cont.). PCA Simulation Idea: given Mean Vector Eigenvectors Eigenvalues Simulate data from Corresponding Normal Distribution.
Multivariate statistical methods. Multivariate methods multivariate dataset – group of n objects, m variables (as a rule n>m, if possible). confirmation.
Unsupervised Learning II Feature Extraction
Object Orie’d Data Analysis, Last Time Organizational Matters
Cornea Data Main Point: OODA Beyond FDA Recall Interplay: Object Space  Descriptor Space.
Methods of multivariate analysis Ing. Jozef Palkovič, PhD.
Unsupervised Learning
Return to Big Picture Main statistical goals of OODA:
Object Orie’d Data Analysis, Last Time
University of Ioannina
LECTURE 10: DISCRIMINANT ANALYSIS
Principal Component Analysis (PCA)
Functional Data Analysis
Statistics – O. R. 881 Object Oriented Data Analysis
Statistics – O. R. 881 Object Oriented Data Analysis
Maximal Data Piling MDP in Increasing Dimensions:
Principal Nested Spheres Analysis
Dimension reduction : PCA and Clustering
LECTURE 09: DISCRIMINANT ANALYSIS
Unsupervised Learning
Statistics – O. R. 891 Object Oriented Data Analysis
Presentation transcript:

Statistics – O. R. 892 Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research University of North Carolina

Administrative Info Details on Course Web Page Or: –Google: “Marron Courses” –Choose This Course Go Through These

Who are we? Varying Levels of Expertise –2 nd Year Graduate Students –… –Faculty Level Researchers Various Backgrounds –Statistics –Computer Science – Imaging –Bioinformatics –Pharmacy –Others?

Course Expectations Grading Based on: “Participant Presentations” 5 – 10 minute talks By Enrolled Students Hopefully Others

Class Meeting Style When you don’t understand something Many others probably join you So please fire away with questions Discussion usually enlightening for others If needed, I’ll tell you to shut up (essentially never happens)

Object Oriented Data Analysis What is it? A Sound-Bite Explanation: What is the “atom of the statistical analysis”? 1 st Course: Numbers Multivariate Analysis Course : Vectors Functional Data Analysis: Curves

Functional Data Analysis Active new field in statistics, see: Ramsay, J. O. & Silverman, B. W. (2005) Functional Data Analysis, 2 nd Edition, Springer, N.Y. Ramsay, J. O. & Silverman, B. W. (2002) Applied Functional Data Analysis, Springer, N.Y. Ramsay, J. O. (2005) Functional Data Analysis Web Site,

Object Oriented Data Analysis What is it? A Sound-Bite Explanation: What is the “atom of the statistical analysis”? 1 st Course: Numbers Multivariate Analysis Course : Vectors Functional Data Analysis: Curves More generally: Data Objects

Object Oriented Data Analysis Nomenclature Clash? Computer Science View: Object Oriented Programming: Programming that supports encapsulation, inheritance, and polymorphism (from Google: define object oriented programming, my favorite:

Object Oriented Data Analysis Some statistical history: John Chambers Idea (1960s - ): Object Oriented approach to statistical analysis Developed as software package S –Basis of S-plus (commerical product) –And of R (free-ware, current favorite of Chambers) Reference for more on this: Venables, W. N. and Ripley, B. D. (2002) Modern Applied Statistics with S, Fourth Edition, Springer, N. Y., ISBN

Object Oriented Data Analysis Another take: J. O. Ramsay “Functional Data Objects” (closer to C. S. meaning) Personal Objection: “Functional” in mathematics is: “Function that operates on functions”

Object Oriented Data Analysis Current Motivation:  In Complicated Data Analyses  Fundamental (Non-Obvious) Question Is: “What Should We Take as Data Objects?”  Key to Focussing Needed Analyses

Object Oriented Data Analysis Reviewer for Annals of Applied Statistics: Why not just say: “Experimental Units”?  Useful for some situations  But misses different representations E.g. log transformations …

Object Oriented Data Analysis Comment from Randy Eubank: This terminology: "Object Oriented Data Analysis" First appeared in Florida FDA Meeting:

Object Oriented Data Analysis References: Wang and Marron (2007) Marron and Alonso (2014)

Object Oriented Data Analysis What is Actually Done? Major Statistical Tasks: Understanding Population Structure Classification (i. e. Discrimination) Time Series of Data Objects “Vertical Integration” of Datatypes

Visualization How do we look at data? Start in Euclidean Space, Will later study other spaces

Notation

Visualization How do we look at Euclidean data? 1-d: histograms, etc. 2-d: scatterplots 3-d: spinning point clouds

Visualization How do we look at Euclidean data? Higher Dimensions? Workhorse Idea: Projections

Projection Important Point There are many “directions of interest” on which projection is useful An important set of directions: Principal Components

Illustration of Multivariate View: Raw Data

Illustration of Multivariate View: Highlight One

Illustration of Multivariate View: Gene 1 Express ’ n

Illustration of Multivariate View: Gene 2 Express ’ n

Illustration of Multivariate View: Gene 3 Express ’ n

Illust ’ n of Multivar. View: 1-d Projection, X- axis

Illust ’ n of Multivar. View: X-Projection, 1-d view

X Coordinates Are Projections

Illust ’ n of Multivar. View: X-Projection, 1-d view Y Coordinates Show Order in Data Set (or Random)

Illust ’ n of Multivar. View: X-Projection, 1-d view Smooth histogram = Kernel Density Estimate

Illust ’ n of Multivar. View: 1-d Projection, Y- axis

Illust ’ n of Multivar. View: Y-Projection, 1-d view

Illust ’ n of Multivar. View: 1-d Projection, Z- axis

Illust ’ n of Multivar. View: Z-Projection, 1-d view

Illust ’ n of Multivar. View: 2-d Proj ’ n, XY- plane

Illust ’ n of Multivar. View: XY-Proj ’ n, 2-d view

Illust ’ n of Multivar. View: 2-d Proj ’ n, XZ- plane

Illust ’ n of Multivar. View: XZ-Proj ’ n, 2-d view

Illust ’ n of Multivar. View: 2-d Proj ’ n, YZ- plane

Illust ’ n of Multivar. View: YZ-Proj ’ n, 2-d view

Illust ’ n of Multivar. View: all 3 planes

Illust ’ n of Multivar. View: Diagonal 1-d proj ’ ns

Illust ’ n of Multivar. View: Add off-diagonals

Illust ’ n of Multivar. View: Typical View

Projection Important Point There are many “directions of interest” on which projection is useful An important set of directions: Principal Components

Find Directions of: “Maximal (projected) Variation” Compute Sequentially On Orthogonal Subspaces Will take careful look at mathematics later

Principal Components For simple, 3-d toy data, recall raw data view:

Principal Components PCA just gives rotated coordinate system:

Principal Components Early References: Pearson (1901) Hotelling (1933)

Illust ’ n of PCA View: Recall Raw Data

Illust ’ n of PCA View: Recall Gene by Gene Views

Illust ’ n of PCA View: PC1 Projections

Note Different Axis Chosen to Maximize Spread

Illust ’ n of PCA View: PC1 Projections, 1-d View

Illust ’ n of PCA View: PC2 Projections

Illust ’ n of PCA View: PC2 Projections, 1-d View

Illust ’ n of PCA View: PC3 Projections

Illust ’ n of PCA View: PC3 Projections, 1-d View

Illust ’ n of PCA View: Projections on PC1,2 plane

Illust ’ n of PCA View: PC1 & 2 Proj ’ n Scatterplot

Illust ’ n of PCA View: Projections on PC1,3 plane

Illust ’ n of PCA View: PC1 & 3 Proj ’ n Scatterplot

Illust ’ n of PCA View: Projections on PC2,3 plane

Illust ’ n of PCA View: PC2 & 3 Proj ’ n Scatterplot

Illust ’ n of PCA View: All 3 PC Projections

Illust ’ n of PCA View: Matrix with 1-d proj ’ ns on diag.

Illust ’ n of PCA: Add off-diagonals to matrix

Illust ’ n of PCA View: Typical View

Comparison of Views Highlight 3 clusters Gene by Gene View –Clusters appear in all 3 scatterplots –But never very separated PCA View –1 st shows three distinct clusters –Better separated than in gene view –Clustering concentrated in 1 st scatterplot Effect is small, since only 3-d

Illust ’ n of PCA View: Gene by Gene View

Illust ’ n of PCA View: PCA View

Clusters are “more distinct” Since more “air space” In between

Another Comparison of Views Much higher dimension, # genes = 4000 Gene by Gene View

Another Comparison: Gene by Gene View

Very Small Differences Between Means

Another Comparison of Views Much higher dimension, # genes = 4000 Gene by Gene View –Clusters very nearly the same –Very slight difference in means

Another Comparison: PCA View

Another Comparison of Views Much higher dimension, # genes = 4000 Gene by Gene View –Clusters very nearly the same –Very slight difference in means PCA View –Huge difference in 1 st PC Direction –Magnification of clustering –Lesson: Alternate views can show much more –(especially in high dimensions, i.e. for many genes) –Shows PC view is very useful

Data Object Conceptualization Object Space  Descriptor Space Curves Images Manifolds Shapes Tree Space Trees

E.g. Curves As Data Object Space: Set of curves Descriptor Space(s): Curves digitized to vectors (look at 1 st ) Basis Representations: Fourier (sin & cos) B-splines Wavelets

E.g. Curves As Data, I

Functional Data Analysis, Toy EG I

Functional Data Analysis, Toy EG II

Functional Data Analysis, Toy EG III

Functional Data Analysis, Toy EG IV

Functional Data Analysis, Toy EG V

Functional Data Analysis, Toy EG VI

Classical Terminology: Coefficients of Projections are “Scores” Entries of Direction Vector are “Loadings”

Functional Data Analysis, Toy EG VII

Functional Data Analysis, Toy EG VIII

Terminology: “Loadings Plot” “Scores Plot”

Functional Data Analysis, Toy EG IX

Functional Data Analysis, Toy EG X

E.g. Curves As Data, I

E.g. Curves As Data, II

Functional Data Analysis, 10-d Toy EG 1

Terminology: “Loadings Plots” “Scores Plots”

Functional Data Analysis, 10-d Toy EG 1

E.g. Curves As Data, II PCA: reveals “population structure” Mean  Parabolic Structure PC1  Vertical Shift PC2  Tilt higher PCs  Gaussian (spherical) Decomposition into modes of variation