Statistics – O. R. 892 Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research University of North Carolina.

Slides:



Advertisements
Similar presentations
Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When.
Advertisements

EigenFaces and EigenPatches Useful model of variation in a region –Region must be fixed shape (eg rectangle) Developed for face recognition Generalised.
Dimensionality Reduction PCA -- SVD
Object Orie’d Data Analysis, Last Time Finished NCI 60 Data Started detailed look at PCA Reviewed linear algebra Today: More linear algebra Multivariate.
Principal Component Analysis CMPUT 466/551 Nilanjan Ray.
Dimensional reduction, PCA
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Exploring Microarray data Javier Cabrera. Outline 1.Exploratory Analysis Steps. 2.Microarray Data as Multivariate Data. 3.Dimension Reduction 4.Correlation.
PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Prof.Dr.Cevdet Demir
Matlab Software To Do Analyses as in Marron’s Talks Matlab Available from UNC Site License Download Software: Google “Marron Software”
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
Statistics – O. R. 892 Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research University of North Carolina.
Object Orie’d Data Analysis, Last Time Finished Algebra Review Multivariate Probability Review PCA as an Optimization Problem (Eigen-decomp. gives rotation,
Chapter 2 Dimensionality Reduction. Linear Methods
Principal Components Analysis BMTRY 726 3/27/14. Uses Goal: Explain the variability of a set of variables using a “small” set of linear combinations of.
Object Orie’d Data Analysis, Last Time Gene Cell Cycle Data Microarrays and HDLSS visualization DWD bias adjustment NCI 60 Data Today: More NCI 60 Data.
DNA microarray technology allows an individual to rapidly and quantitatively measure the expression levels of thousands of genes in a biological sample.
Object Orie’d Data Analysis, Last Time Organizational Matters
1 UNC, Stat & OR Nonnegative Matrix Factorization.
Statistics – O. R. 891 Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research University of North Carolina.
Robust PCA Robust PCA 3: Spherical PCA. Robust PCA.
Object Orie’d Data Analysis, Last Time Discrimination for manifold data (Sen) –Simple Tangent plane SVM –Iterated TANgent plane SVM –Manifold SVM Interesting.
Descriptive Statistics vs. Factor Analysis Descriptive statistics will inform on the prevalence of a phenomenon, among a given population, captured by.
1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, II J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina.
Maximal Data Piling Visual similarity of & ? Can show (Ahn & Marron 2009), for d < n: I.e. directions are the same! How can this be? Note lengths are different.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Object Orie’d Data Analysis, Last Time Organizational Matters What is OODA? Visualization by Projection.
Return to Big Picture Main statistical goals of OODA: Understanding population structure –Low dim ’ al Projections, PCA … Classification (i. e. Discrimination)
PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Richard Brereton
Participant Presentations Please Sign Up: Name (Onyen is fine, or …) Are You ENRolled? Tentative Title (???? Is OK) When: Thurs., Early, Oct., Nov.,
Stat 31, Section 1, Last Time Course Organization & Website What is Statistics? Data types.
Principal Component Analysis (PCA)
Statistics – O. R. 893 Object Oriented Data Analysis Steve Marron Dept. of Statistics and Operations Research University of North Carolina.
Statistics – O. R. 892 Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research University of North Carolina.
Statistics – O. R. 893 Object Oriented Data Analysis Steve Marron Dept. of Statistics and Operations Research University of North Carolina.
Participant Presentations Please Prepare to Sign Up: Name (Onyen is fine, or …) Are You ENRolled? Tentative Title (???? Is OK) When: Next Week, Early,
Object Orie’d Data Analysis, Last Time PCA Redistribution of Energy - ANOVA PCA Data Representation PCA Simulation Alternate PCA Computation Primal – Dual.
1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, I J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina.
Object Orie’d Data Analysis, Last Time PCA Redistribution of Energy - ANOVA PCA Data Representation PCA Simulation Alternate PCA Computation Primal – Dual.
PCA as Optimization (Cont.) Recall Toy Example Empirical (Sample) EigenVectors Theoretical Distribution & Eigenvectors Different!
Participant Presentations Please Sign Up: Name (Onyen is fine, or …) Are You ENRolled? Tentative Title (???? Is OK) When: Next Week, Early, Oct.,
GWAS Data Analysis. L1 PCA Challenge: L1 Projections Hard to Interpret (i.e. Little Data Insight) Solution: 1)Compute PC Directions Using L1 2)Compute.
Object Orie’d Data Analysis, Last Time Reviewed Clustering –2 means Cluster Index –SigClust When are clusters really there? Q-Q Plots –For assessing Goodness.
1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, III J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina.
Principal Components Analysis ( PCA)
Central limit theorem - go to web applet. Correlation maps vs. regression maps PNA is a time series of fluctuations in 500 mb heights PNA = 0.25 *
Unsupervised Learning II Feature Extraction
Object Orie’d Data Analysis, Last Time Organizational Matters
Cornea Data Main Point: OODA Beyond FDA Recall Interplay: Object Space  Descriptor Space.
Distance Weighted Discrim ’ n Based on Optimization Problem: For “Residuals”:
Object Orie’d Data Analysis, Last Time Finished NCI 60 Data Linear Algebra Review Multivariate Probability Review PCA as an Optimization Problem (Eigen-decomp.
1 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Machine learning, pattern recognition and statistical data modelling.
Unsupervised Learning
Statistical Smoothing
SiZer Background Finance "tick data":
Principal Component Analysis (PCA)
Functional Data Analysis
Dimension Reduction via PCA (Principal Component Analysis)
Radial DWD Main Idea: Linear Classifiers Good When Each Class Lives in a Distinct Region Hard When All Members Of One Class Are Outliers in a Random Direction.
Statistics – O. R. 881 Object Oriented Data Analysis
Statistics – O. R. 881 Object Oriented Data Analysis
Maximal Data Piling MDP in Increasing Dimensions:
Principal Nested Spheres Analysis
Principal Component Analysis
Dimension reduction : PCA and Clustering
Feature space tansformation methods
Unsupervised Learning
Statistics – O. R. 891 Object Oriented Data Analysis
Presentation transcript:

Statistics – O. R. 892 Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research University of North Carolina

Administrative Info Details on Course Web Page Or: –Google: “Marron Courses” –Choose This Course

Object Oriented Data Analysis What is it? A Sound-Bite Explanation: What is the “atom of the statistical analysis”? 1 st Course: Numbers Multivariate Analysis Course : Vectors Functional Data Analysis: Curves More generally: Data Objects

Object Oriented Data Analysis Current Motivation:  In Complicated Data Analyses  Fundamental (Non-Obvious) Question Is: “What Should We Take as Data Objects?”  Key to Focussing Needed Analyses

Illustration of Multivariate View: Raw Data

Illust ’ n of Multivar. View: Typical View

Principal Components Find Directions of: “Maximal (projected) Variation” Compute Sequentially On Orthogonal Subspaces Will take careful look at mathematics later

Principal Components For simple, 3-d toy data, recall raw data view:

Principal Components PCA just gives rotated coordinate system:

Illust ’ n of PCA View: PCA View Clusters are “more distinct” Since more “air space” In between

Another Comparison: Gene by Gene View Very Small Differences Between Means

Another Comparison: PCA View

Data Object Conceptualization Object Space  Descriptor Space Curves Images Manifolds Shapes Tree Space Trees

Functional Data Analysis, Toy EG I

Functional Data Analysis, Toy EG X

Functional Data Analysis, 10-d Toy EG 1 Terminology: “Loadings Plots” “Scores Plots”

Functional Data Analysis, 10-d Toy EG 1

E.g. Curves As Data, II PCA: reveals “population structure” Mean  Parabolic Structure PC1  Vertical Shift PC2  Tilt higher PCs  Gaussian (spherical) Decomposition into modes of variation

E.g. Curves As Data, III Two Cluster Example 10-d curves again

Functional Data Analysis, 10-d Toy EG 2

E.g. Curves As Data, III Two Cluster Example 10-d curves again Two big clusters Revealed by 1-d projection plot (left side) Note: Cluster Difference is not orthogonal to Vertical Shift PCA: reveals “population structure”

E.g. Curves As Data, IV More Complicated Example 50-d curves

Functional Data Analysis, 50-d Toy EG 3

E.g. Curves As Data, IV More Complicated Example 50-d curves Pop’n structure hard to see in 1-d 2-d projections make structure clear PCA: reveals “population structure”

Interesting Data Set: Mortality Data For Spanish Males (thus can relate to history) Each curve is a single year x coordinate is age Mortality = # died / total # (for each age) Study on log scale Investigate change over years 1908 – 2002 From Andrés Alonso, U. Carlos III, Madrid {See Marron & Alonso (2014)} Functional Data Analysis

Interesting Data Set: Mortality Data For Spanish Males (thus can relate to history) Each curve is a single year x coordinate is age Note: Choice made of Data Object (could also study age as curves, x coordinate = time) Functional Data Analysis

Important Issue: What are the Data Objects? Mortality vs. Age Curves (over years) Mortality vs. Year Curves (over ages) Note: Rows vs. Columns of Data Matrix Functional Data Analysis

Important Issue: What are the Data Objects? Mortality vs. Age Curves (over years) Mortality vs. Year Curves (over ages) Note: Rows vs. Columns of Data Matrix Functional Data Analysis

Interesting Data Set: Mortality Data For Spanish Males (thus can relate to history) Each curve is a single year x coordinate is age Mortality = # died / total # (for each age) Study on log scale Another Data Object Choice (not about experimental units) Functional Data Analysis

Mortality Time Series Conventional Coloring: Rotate Through (7) Colors Hard to See Time Structure

Mortality Time Series Improved Coloring: Rainbow Representing Year: Magenta = 1908 Red = 2002

Color Code (Years) Mortality Time Series

Find Population Center (Mean Vector) Compute in Descriptor Space Show in Object Space

Mortality Time Series Blips Appear At Decades Since Ages Not Precise (in Spain) Reported as “about 50”, Etc.

Mortality Time Series Mean Residual Descriptor Space View of Shifting Data To Origin In Feature Space

Mortality Time Series Shows: Main Age Effects in Mean, Not Variation About Mean

Mortality Time Series Object Space View of Projections Onto PC1 Direction Main Mode Of Variation: Constant Across Ages

Mortality Time Series Object Space View of Projections Onto PC1 Direction Main Mode Of Variation: Constant Across Ages Loadings Plot

Mortality Time Series Shows Major Improvement Over Time (medical technology, etc.) And Change In Age Rounding Blips

Mortality Time Series Corresponding PC 1 Scores Again Shows Overall Improvement High Mortality Early

Mortality Time Series Corresponding PC 1 Scores Again Shows Overall Improvement High Mortality Early Lower Later

Mortality Time Series Outliers

Mortality Time Series Outliers 1918 Global Flu Pandemic

Mortality Time Series Outliers 1918 Global Flu Pandemic

Mortality Time Series Outliers 1918 Global Flu Pandemic Spanish Civil War

Mortality Time Series Object Space View of Projections Onto PC2 Direction

Mortality Time Series Object Space View of Projections Onto PC2 Direction Loadings Plot

Mortality Time Series Object Space View of Projections Onto PC2 Direction 2nd Mode Of Variation: Difference Between & Rest

Mortality Time Series Explain Using PC 2 Scores Early Improvement

Mortality Time Series Explain Using PC 2 Scores Early Improvement Pandemic Hit Hardest

Mortality Time Series Explain Using PC 2 Scores Then better

Mortality Time Series Explain Using PC 2 Scores Then better Spanish Civil War Hit Hardest

Mortality Time Series Explain Using PC 2 Scores Steady Improvement To mid-50s

Mortality Time Series Explain Using PC 2 Scores Steady Improvement To mid-50s Increasing Automotive Death Rate

Mortality Time Series Explain Using PC 2 Scores Corner Finally Turned by Safer Cars & Roads

Mortality Time Series Scores Plot Descriptor (Point Cloud) Space View Connecting Lines Highlight Time Order Good View of Historical Effects

Mortality Time Series Try a Related Mortality Data Set: Switzerland (In Europe, but different history)

Mortality Time Series – Swiss Males

Some Points Similar to Spain:  PC1: Overall Improvement  Better for Young  PC2: About 20 – 45 vs. Others  Flu Pandemic  Automobile Effects Some Quite Different: o No Age Rounding o No Civil War

Time Series of Data Objects Mortality Data Illustrates an Important Point: OODA is more than a “framework” It Provides a Focal Point Highlights Pivotal Choice: What should be the Data Objects?

Time Series of Curves Just a “Set of Curves” But Time Order is Important! Useful Approach (as above): Use color to code for time Start End

Time Series Toy E.g. Explore Question: “Is Horizontal Motion Linear Variation?” Example: Set of time shifted Gaussian densities View: Code time with colors as above

T. S. Toy E.g., Raw Data

T. S. Toy E.g., PCA View PCA gives “Modes of Variation” But there are Many…

T. S. Toy E.g., PCA View PCA gives “Modes of Variation” But there are Many… Intuitively Useful??? Like “harmonics”?

T. S. Toy E.g., PCA View PCA gives “Modes of Variation” But there are Many… Intuitively Useful??? Like “harmonics”? Isn’t there only 1 mode of variation?

T. S. Toy E.g., PCA View PCA gives “Modes of Variation” But there are Many… Intuitively Useful??? Like “harmonics”? Isn’t there only 1 mode of variation? Answer comes in scores scatterplots

T. S. Toy E.g., PCA Scatterplot

Where is the Point Cloud? Lies along a 1-d curve in

T. S. Toy E.g., PCA Scatterplot Where is the Point Cloud? Lies along a 1-d curve in So actually have 1-d mode of variation But a non-linear mode of variation

T. S. Toy E.g., PCA Scatterplot Where is the Point Cloud? Lies along a 1-d curve in So actually have 1-d mode of variation But a non-linear mode of variation Poorly captured by PCA (linear method) Will study more later

Interesting Question Can we find a 1-d Representation? Perhaps by: Different Analysis? Other Choice of Data Objects?

Chemo-metric Time Series Mass Spectrometry Measurements On an Aging Substance, called “Estane” Made over Logarithmic Time Grid, n = 60 Each Data Object is a Spectrum What about Time Evolution? Approach: PCA & Time Coloring

Chemo-metric Time Series Joint Work w/ E. Kober & J. Wendelberger Los Alamos National Lab Four Experimental Conditions: 1.Control 2.Aged 59 days in Dry Air 3.Aged 27 days in Humid Air 4.Aged 59 days in Humid Air

Chemo-metric Time Series, HA 27

Raw Data: All 60 spectra essentially the same “Scale” of mean is much bigger than variation about mean Hard to see structure of all 1600 freq’s Centered Data: Now can see different spectra Since mean subtracted off Note much smaller vertical axis

Chemo-metric Time Series, HA 27 Next Zoom in on “Region of Interest”

Chemo-metric Time Series, HA 27

Data zoomed to “important” freq’s: Raw Data: Now see slight differences Smoother “natural looking” spectra Centered Data: Differences in spectra more clear Maybe now have “real structure” Scale is important

Chemo-metric Time Series, HA 27

Use of Time Order Coloring: Raw Data: Can see a little ordering, not much Centered Data: Clear time ordering Shifting peaks? (compare to Raw) PC1: Almost everything? PC1 Residuals: Data nearly linear (same scale import’nt)

Chemo-metric Time Series, Control

PCA View Clear systematic structure Time ordering very important Reminiscent of Toy Example (Shifting Gaussian Peak) Roughly 1-d curve in Feature Space Physical Explanation?

Toy Data Explanations Simple Chemical Reaction Model: Subst. 1 transforms into Subst. 2 Note: linear path in Feature Space

Toy Data Explanations Richer Chemical Reaction Model: Subst. 1  Subst. 2  Subst. 3 Curved path in Feat. Sp. 2 Reactions  Curve lies in 2-dim’al subsp.

Toy Data Explanations Another Chemical Reaction Model: Subst. 1  Subst. 2 & Subst. 5  Subst. 6 Curved path in Feat. Sp. 2 Reactions  Curve lies in 2-dim’al subsp.

Toy Data Explanations More Complex Chemical Reaction Model: 1  2  3  4 Curved path in Feat. Sp. (lives in 3-d) 3 Reactions  Curve lies in 3-dim’al subsp.

Toy Data Explanations Even More Complex Chemical Reaction Model: 1  2  3  4  5 Curved path in Feat. Sp. (lives in 4-d) 4 Reactions  Curve lies in 4-dim’al subsp.

Chemo-metric Time Series, Control

Suggestions from Toy Examples: Clearly 3 reactions under way Maybe a 4 th ??? Hard to distinguish from noise? Interesting statistical open problem!

Chemo-metric Time Series What about the other experiments? Recall: 1.Control 2.Aged 59 days in Dry Air 3.Aged 27 days in Humid Air 4.Aged 59 days in Humid Air Above results were “cherry picked”, to best makes points What about other cases???

Scatterplot Matrix, Control Above E.g., maybe ~4d curve  ~4 reactions

Scatterplot Matrix, Da59 PC2 is “bleeding of CO2”, discussed below

Scatterplot Matrix, Ha27 Only “3-d + noise”?  Only 3 reactions

Scatterplot Matrix, Ha59 Harder to judge???

Object Space View, Control Terrible discretization effect, despite ~4d …

Object Space View, Da59 OK, except strange at beginning (CO2 …)

Object Space View, Ha27 Strong structure in PC1 Resid (d < 2)

Object Space View, Ha59 Lots at beginning, OK since “oldest”

Problem with Da59 What about strange behavior for DA59? Recall: PC2 showed “really different behavior at start” Chemists comments: Ignore this, should have started measuring later…

Problem with Da59 But still fun to look at broader spectra

Chemo-metric T. S. Joint View Throw them all together as big population Take Point Cloud View

Chemo-metric T. S. Joint View

Throw them all together as big population Take Point Cloud View Note 4d space of interest, driven by: 4 clusters (3d) PC1 of chemical reaction (1-d) But these don’t appear as the 4 PCs Chem. PC1 “spread over PC2,3,4” Essentially a “rotation of interesting dir’ns” How to “unrotate”???

Chemo-metric T. S. Joint View Interesting Variation: Remove cluster means Allows clear comparison of within curve variation

Chemo-metric T. S. Joint View (- mean)

Chemo-metric T. S. Joint View Interesting Variation: Remove cluster means Allows clear comparison of within curve variation: PC1 versus others are quite revealing (note different “rotations”) Others don’t show so much

Functional Data Analysis Interesting Real Data Example Genetics (Cancer Research) RNAseq (Next Gener’n Sequen’g) Deep look at “gene components” Microarrays: Single number (per gene) RNAseq: Thousands of measurements

Interesting Real Data Example Genetics (Cancer Research) RNAseq (Next Gener’n Sequen’g) Deep look at “gene components” Gene studied here: CDNK2A Goal: Study Alternate Splicing Sample Size, n = 180 Dimension, d = ~1700 Functional Data Analysis

Simple 1 st View: Curve Overlay (log scale) Functional Data Analysis

Often Useful Population View: PCA Scores Functional Data Analysis

Suggestion Of Clusters ??? Functional Data Analysis

Suggestion Of Clusters Which Are These? Functional Data Analysis

Manually “Brush” Clusters Functional Data Analysis

Manually Brush Clusters Clear Alternate Splicing Functional Data Analysis

Important Points PCA found Important Structure In High Dimensional Data Analysis d ~ 1700 Functional Data Analysis

Consequences:  Led to Development of SigFuge  Whole Genome Scan Found Interesting Genes  Published in Kimes, et al (2014) Functional Data Analysis

Interesting Question: When are clusters really there? (will study later) Functional Data Analysis

Limitation of PCA PCA can provide useful projection directions

Limitation of PCA PCA can provide useful projection directions But can’t “see everything”…

Limitation of PCA PCA can provide useful projection directions But can’t “see everything”… Reason: PCA finds dir’ns of maximal variation Which may obscure interesting structure

Limitation of PCA Toy Example: Apple – Banana – Pear

Limitation of PCA, Toy E.g.

Limitation of PCA Toy Example: Apple – Banana – Pear Obscured by “noisy dimensions” 1 st 3 PC directions only show noise

Limitation of PCA Toy Example: Apple – Banana – Pear Obscured by “noisy dimensions” 1 st 3 PC directions only show noise Study some rotations, to find structure

Limitation of PCA, Toy E.g.

Limitation of PCA Toy Example: Rotation shows Apple – Banana – Pear Example constructed as: 1 st make these in 3-d Add 3 dimensions of high s.d. noise Carefully watch axis labels

Limitation of PCA Main Point:  May be Important Data Structure  Not Visible in 1 st Few PCs

Limitation of PCA, E.g. Interesting Data Set: NCI-60

Limitation of PCA, E.g. Interesting Data Set: NCI-60  NCI = National Cancer Institute  60 Cell Lines (cancer treatment targets) For Different Cancer Types

Limitation of PCA, E.g. Interesting Data Set: NCI-60  NCI = National Cancer Institute  60 Cell Lines (cancer treatment targets) For Different Cancer Types  Measured “Gene Expression” = “Gene Activity”  Several Thousand Genes (Simultaneously)

Limitation of PCA, E.g. Interesting Data Set: NCI-60  NCI = National Cancer Institute  60 Cell Lines (cancer treatment targets) For Different Cancer Types  Measured “Gene Expression” = “Gene Activity”  Several Thousand Genes (Simultaneously)  Data Objects = Vectors of Gene Exp’n  Lots of Preprocessing (study later)

NCI 60 Data Important Aspect: 8 Cancer Types Renal Cancer Non Small Cell Lung Cancer Central Nervous System Cancer Ovarian Cancer Leukemia Cancer Colon Cancer Breast Cancer Melanoma (Skin)

PCA Visualization of NCI 60 Data Can we find classes: (Renal, CNS, Ovar, Leuk, Colon, Melan) Using PC directions?? First try “unsupervised view” I.e. switch off class colors Then turn on colors, to identify clusters I.e. look at “supervised view”

NCI 60: Can we find classes Using PCA view?

PCA Visualization of NCI 60 Data Maybe need to look at more PCs? Study array of such PCA projections:

NCI 60: Can we find classes Using PCA 5-8?

NCI 60: Can we find classes Using PCA 9-12?

NCI 60: Can we find classes Using PCA 1-4 vs. 5-8?

NCI 60: Can we find classes Using PCA 1-4 vs. 9-12?

NCI 60: Can we find classes Using PCA 5-8 vs. 9-12?

PCA Visualization of NCI 60 Data Can we find classes using PC directions?? Found some, but not others Nothing after 1 st five PCs Rest seem to be noise driven

PCA Visualization of NCI 60 Data Can we find classes using PC directions?? Found some, but not others Nothing after 1 st five PCs Rest seem to be noise driven Are There Better Directions? PCA only “feels” maximal variation Ignores Class Labels How Can We Use Class Labels?