Object Orie’d Data Analysis, Last Time Organizational Matters What is OODA? Visualization by Projection.

Slides:



Advertisements
Similar presentations
Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When.
Advertisements

Announcements. Structure-from-Motion Determining the 3-D structure of the world, and/or the motion of a camera using a sequence of images taken by a moving.
Tetris and Genetic Algorithms Math Club 5/30/2011.
Analysis. Start with describing the features you see in the data.
Object Orie’d Data Analysis, Last Time Finished NCI 60 Data Started detailed look at PCA Reviewed linear algebra Today: More linear algebra Multivariate.
Participant Presentations Please Sign Up: Name (Onyen is fine, or …) Are You ENRolled? Tentative Title (???? Is OK) When: Next Week, Early, Oct.,
Basis State Prediction of Cell-Cycle Transcription Factors in Saccharomyces cerevisiae Dr. Matteo Pellegrini Dr. Shawn Cokus Sherri Rose UCLA Molecular,
Dimensional reduction, PCA
The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.
Exploring Microarray data Javier Cabrera. Outline 1.Exploratory Analysis Steps. 2.Microarray Data as Multivariate Data. 3.Dimension Reduction 4.Correlation.
PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Prof.Dr.Cevdet Demir
Matlab Software To Do Analyses as in Marron’s Talks Matlab Available from UNC Site License Download Software: Google “Marron Software”
Statistics – O. R. 892 Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research University of North Carolina.
Object Orie’d Data Analysis, Last Time Finished Algebra Review Multivariate Probability Review PCA as an Optimization Problem (Eigen-decomp. gives rotation,
Chapter 2 Dimensionality Reduction. Linear Methods
Statistics – O. R. 892 Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research University of North Carolina.
Object Orie’d Data Analysis, Last Time Gene Cell Cycle Data Microarrays and HDLSS visualization DWD bias adjustment NCI 60 Data Today: More NCI 60 Data.
Object Orie’d Data Analysis, Last Time
1 Dimension Reduction Examples: 1. DNA MICROARRAYS: Khan et al (2001): 4 types of small round blue cell tumors (SRBCT) Neuroblastoma (NB) Rhabdomyosarcoma.
Object Orie’d Data Analysis, Last Time Organizational Matters
Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.
Return to Big Picture Main statistical goals of OODA: Understanding population structure –Low dim ’ al Projections, PCA … Classification (i. e. Discrimination)
1 UNC, Stat & OR Nonnegative Matrix Factorization.
Statistics – O. R. 891 Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research University of North Carolina.
Robust PCA Robust PCA 3: Spherical PCA. Robust PCA.
Object Orie’d Data Analysis, Last Time Discrimination for manifold data (Sen) –Simple Tangent plane SVM –Iterated TANgent plane SVM –Manifold SVM Interesting.
Descriptive Statistics vs. Factor Analysis Descriptive statistics will inform on the prevalence of a phenomenon, among a given population, captured by.
Object Orie’d Data Analysis, Last Time Classification / Discrimination Classical Statistical Viewpoint –FLD “good” –GLR “better” –Conclude always do GLR.
1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, II J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina.
SiZer Background Scale Space – Idea from Computer Vision Goal: Teach Computers to “See” Modern Research: Extract “Information” from Images Early Theoretical.
Maximal Data Piling Visual similarity of & ? Can show (Ahn & Marron 2009), for d < n: I.e. directions are the same! How can this be? Note lengths are different.
Common Property of Shape Data Objects: Natural Feature Space is Curved I.e. a Manifold (from Differential Geometry) Shapes As Data Objects.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Return to Big Picture Main statistical goals of OODA: Understanding population structure –Low dim ’ al Projections, PCA … Classification (i. e. Discrimination)
Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.
PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Richard Brereton
Participant Presentations Please Sign Up: Name (Onyen is fine, or …) Are You ENRolled? Tentative Title (???? Is OK) When: Thurs., Early, Oct., Nov.,
Statistics – O. R. 893 Object Oriented Data Analysis Steve Marron Dept. of Statistics and Operations Research University of North Carolina.
Statistics – O. R. 892 Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research University of North Carolina.
Statistics – O. R. 893 Object Oriented Data Analysis Steve Marron Dept. of Statistics and Operations Research University of North Carolina.
Participant Presentations Please Prepare to Sign Up: Name (Onyen is fine, or …) Are You ENRolled? Tentative Title (???? Is OK) When: Next Week, Early,
Object Orie’d Data Analysis, Last Time PCA Redistribution of Energy - ANOVA PCA Data Representation PCA Simulation Alternate PCA Computation Primal – Dual.
1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, I J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina.
Object Orie’d Data Analysis, Last Time PCA Redistribution of Energy - ANOVA PCA Data Representation PCA Simulation Alternate PCA Computation Primal – Dual.
PCA as Optimization (Cont.) Recall Toy Example Empirical (Sample) EigenVectors Theoretical Distribution & Eigenvectors Different!
Participant Presentations Please Sign Up: Name (Onyen is fine, or …) Are You ENRolled? Tentative Title (???? Is OK) When: Next Week, Early, Oct.,
GWAS Data Analysis. L1 PCA Challenge: L1 Projections Hard to Interpret (i.e. Little Data Insight) Solution: 1)Compute PC Directions Using L1 2)Compute.
Object Orie’d Data Analysis, Last Time Reviewed Clustering –2 means Cluster Index –SigClust When are clusters really there? Q-Q Plots –For assessing Goodness.
1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, III J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina.
PCA Data Represent ’ n (Cont.). PCA Simulation Idea: given Mean Vector Eigenvectors Eigenvalues Simulate data from Corresponding Normal Distribution.
Principal Components Analysis ( PCA)
Object Orie’d Data Analysis, Last Time Organizational Matters
Distance Weighted Discrim ’ n Based on Optimization Problem: For “Residuals”:
SigClust Statistical Significance of Clusters in HDLSS Data When is a cluster “really there”? Liu et al (2007), Huang et al (2014)
Object Orie’d Data Analysis, Last Time Finished NCI 60 Data Linear Algebra Review Multivariate Probability Review PCA as an Optimization Problem (Eigen-decomp.
Statistical Smoothing
Return to Big Picture Main statistical goals of OODA:
SiZer Background Finance "tick data":
Exploring Microarray data
Functional Data Analysis
Radial DWD Main Idea: Linear Classifiers Good When Each Class Lives in a Distinct Region Hard When All Members Of One Class Are Outliers in a Random Direction.
Statistics – O. R. 881 Object Oriented Data Analysis
Statistics – O. R. 881 Object Oriented Data Analysis
Maximal Data Piling MDP in Increasing Dimensions:
Principal Component Analysis
Principal Nested Spheres Analysis
X.1 Principal component analysis
Marios Mattheakis and Pavlos Protopapas
Statistics – O. R. 891 Object Oriented Data Analysis
Presentation transcript:

Object Orie’d Data Analysis, Last Time Organizational Matters What is OODA? Visualization by Projection Object Space & Feature Space Curves as Data Data Representation Issues PCA visualization

Data Object Conceptualization Object Space  Feature Space Curves Images Manifolds Shapes Tree Space Trees

Functional Data Analysis, Toy EG I

Easy way to do these analyses Matlab software (user friendly?) available: Download & put in Matlab Path: General Smoothing Look first at: curvdatSM.m scatplotSM.m

Time Series of Curves Again a “Set of Curves” But now Time Order is Important! An approach: Use color to code for time Start End

Time Series Toy E.g. Explore Question of Eli Broadhurst: “Is Horizontal Motion Linear Variation?” Example: Set of time shifted Gaussian densities View: Code time with colors as above

T. S. Toy E.g., Raw Data

T. S. Toy E.g., PCA View PCA gives “Modes of Variation” But there are Many… Intuitively Useful??? Like “harmonics”? Isn’t there only 1 mode of variation? Answer comes in 2-d scatterplots

T. S. Toy E.g., PCA Scatterplot

Where is the Point Cloud? Lies along a 1-d curve in So actually have 1-d mode of variation But a non-linear mode of variation Poorly captured by PCA (linear method) Will study more later

Chemo-metric Time Series Mass Spectrometry Measurements On an Aging Substance, called “Estane” Made over Logarithmic Time Grid, n = 60 Each is a Spectrum What about Time Evolution? Approach: PCA & Time Coloring

Chemo-metric Time Series Joint Work w/ E. Kober & J. Wendelberger Los Alamos National Lab Four Experimental Conditions: 1.Control 2.Aged 59 days in Dry Air 3.Aged 27 days in Humid Air 4.Aged 59 days in Humid Air

Chemo-metric Time Series, HA 27

Raw Data: All 60 spectra essentially the same “Scale” of mean is much bigger than variation about mean Hard to see structure of all 1600 freq’s Centered Data: Now can see different spectra Since mean subtracted off Note much smaller vertical axis

Chemo-metric Time Series, HA 27

Data zoomed to “important” freq’s: Raw Data: Now see slight differences Smoother “natural looking” spectra Centered Data: Differences in spectra more clear Maybe now have “real structure” Scale is important

Chemo-metric Time Series, HA 27

Use of Time Order Coloring: Raw Data: Can see a little ordering, not much Centered Data: Clear time ordering Shifting peaks? (compare to Raw) PC1: Almost everything? PC1 Residuals: Data nearly linear (same scale import’nt)

Chemo-metric Time Series, Control

PCA View Clear systematic structure Time ordering very important Reminiscent of Toy Example A clear 1-d curve in Feature Space Physical Explanation?

Toy Data Explanations Simple Chemical Reaction Model: Subst. 1 transforms into Subst. 2 Note: linear path in Feature Space

Toy Data Explanations Richer Chemical Reaction Model: Subst. 1  Subst. 2  Subst. 3 Curved path in Feat. Sp. 2 Reactions  Curve lies in 2-dim’al subsp.

Toy Data Explanations Another Chemical Reaction Model: Subst. 1  Subst. 2 & Subst. 5  Subst. 6 Curved path in Feat. Sp. 2 Reactions  Curve lies in 2-dim’al subsp.

Toy Data Explanations More Complex Chemical Reaction Model: 1  2  3  4 Curved path in Feat. Sp. (lives in 3-d) 3 Reactions  Curve lies in 3-dim’al subsp.

Toy Data Explanations Even More Complex Chemical Reaction Model: 1  2  3  4  5 Curved path in Feat. Sp. (lives in 4-d) 4 Reactions  Curve lies in 4-dim’al subsp.

Chemo-metric Time Series, Control

Suggestions from Toy Examples: Clearly 3 reactions under way Maybe a 4 th ??? Hard to distinguish from noise? Interesting statistical open problem!

Chemo-metric Time Series What about the other experiments? Recall: 1.Control 2.Aged 59 days in Dry Air 3.Aged 27 days in Humid Air 4.Aged 59 days in Humid Air Above results were “cherry picked”, to best makes points What about cases???

Scatterplot Matrix, Control Above E.g., maybe ~4d curve  ~4 reactions

Scatterplot Matrix, Da59 PC2 is “bleeding of CO2”, discussed below

Scatterplot Matrix, Ha27 Only “3-d + noise”?  Only 3 reactions

Scatterplot Matrix, Ha59 Harder to judge???

Object Space View, Control Terrible discretization effect, despite ~4d …

Object Space View, Da59 OK, except strange at beginning (CO2 …)

Object Space View, Ha27 Strong structure in PC1 Resid (d < 2)

Object Space View, Ha59 Lots at beginning, OK since “oldest”

Problem with Da59 What about strange behavior for DA59? Recall: PC2 showed “really different behavior at start” Chemists comments: Ignore this, should have started measuring later…

Problem with Da59 But still fun to look at broader spectra

Chemo-metric T. S. Joint View Throw them all together as big population Take Point Cloud View

Chemo-metric T. S. Joint View

Throw them all together as big population Take Point Cloud View Note 4d space of interest, driven by: 4 clusters (3d) PC1 of chemical reaction (1-d) But these don’t appear as the 4 PCs Chem. PC1 “spread over PC2,3,4” Essentially a “rotation of interesting dir’ns” How to “unrotate”???

Chemo-metric T. S. Joint View Interesting Variation: Remove cluster means Allows clear comparison of within curve variation

Chemo-metric T. S. Joint View (- mean)

Chemo-metric T. S. Joint View Interesting Variation: Remove cluster means Allows clear comparison of within curve variation: PC1 versus others are quite revealing (note different “rotations”) Others don’t show so much

Demography Data Joint Work with: Andres Alonso Univ. Carlos III, Madrid Mortality, as a function of age “ Chance of dying ”, for Males, in Spain of each 1-year age group Curves are years PCA of the family of curves

Demography Data PCA of the family of curves for Males Babies & elderly “ most mortal ” (Raw) All getting better over time (Raw & PC1) Except Influenza Pandemic (see Color Scale)Color Scale Middle age most mortal (PC2): –1918 –Early 1930s - Spanish Civil War –1980 – 1994 (then better) auto wrecks Decade Rounding (several places)

Demography Data PCA for Females in Spain Most aspects similar (see Color Scale)Color Scale No War Changes –Steady improvement until 70s (PC2) –When auto accidents kicked in

Demography Data PCA for Males in Switzerland Most aspects similar No decade rounding (better records) 1918 Flu – Different Color (PC2) (see Color Scale)Color Scale No War Changes –Steady improvement until 70s (PC2) –When auto accidents kicked in

Demography Data Dual PCA Idea: Rows and Columns trade places Demographic Primal View: Curves are Years, Coord ’ s are Ages Demographic Dual View: Curves are Ages, Coord ’ s are Years Dual PCA View, Spanish Males

Demography Data Dual PCA View, Spanish Males Old people have const. mortality (raw) But improvement for rest (raw) Bad for 1918 (flu) & Spanish Civil War, but generally improving (mean) Improves for ages 1-6, then worse (PC1) Big Improvement for young (PC2) (Age Color Key)Age Color Key

Yeast Cell Cycle Data “ Gene Expression ” – Micro-array data Data (after major preprocessing): Expression “ level ” of: thousands of genes (d ~ 1,000s) but only dozens of “ cases ” (n ~ 10s) Interesting statistical issue: High Dimension Low Sample Size data (HDLSS)

Yeast Cell Cycle Data Data from: Spellman, P. T., Sherlock, G., Zhang, M.Q., Iyer, V.R., Anders, K., Eisen, M.B., Brown, P.O., Botstein, D. and Futcher, B. (1998), “ Comprehensive Identification of Cell Cycle-regulated Genes of the Yeast Saccharomyces cerevisiae by Microarray Hybridization ”, Molecular Biology of the Cell, 9,

Yeast Cell Cycle Data Analysis here is from: Zhao, X., Marron, J.S. and Wells, M.T. (2004) The Functional Data View of Longitudinal Data, Statistica Sinica, 14,

Yeast Cell Cycle Data Lab experiment: Chemically “ synchronize cell cycles ”, of yeast cells Do cDNA micro-arrays over time Used 18 time points, over “ about 2 cell cycles ” Studied 4,489 genes (whole genome) Time series view of data: 4,489 time series of length 18 Functional Data View: 4,489 “ curves ”

Yeast Cell Cycle Data, FDA View Central question: Which genes are “ periodic ” over 2 cell cycles?

Yeast Cell Cycle Data, FDA View Periodic genes? Na ï ve approach: Simple PCA

Yeast Cell Cycle Data, FDA View Central question: which genes are “ periodic ” over 2 cell cycles? Na ï ve approach: Simple PCA No apparent (2 cycle) periodic structure? Eigenvalues suggest large amount of “ variation ” PCA finds “ directions of maximal variation ” Often, but not always, same as “ interesting directions ” Here need better approach to study periodicities

Yeast Cell Cycles, Freq. 2 Proj. PCA on Freq. 2 Periodic Component Of Data

Yeast Cell Cycles, Freq. 2 Proj. PCA on periodic component of data Hard to see periodicities in raw data But very clear in PC1 (~sin) and PC2 (~cos) PC1 and PC2 explain 65% of variation (see residuals) Recall linear combos of sin and cos capture “ phase ” since:

Frequency 2 Analysis Important features of data appear only at frequency 2, Hence project data onto 2-dim space of sin and cos (freq. 2) Useful view: scatterplot

Frequency 2 Analysis

Project data onto 2-dim space of sin and cos (freq. 2) Useful view: scatterplot Angle (in polar coordinates) shows phase Colors: Spellman ’ s cell cycle phase classification Black was labeled “ not periodic ” Within class phases approx ’ ly same, but notable differences Later will try to improve “ phase classification ”