Object Orie’d Data Analysis, Last Time Organizational Matters

Slides:



Advertisements
Similar presentations
Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When.
Advertisements

Object Orie’d Data Analysis, Last Time Finished NCI 60 Data Started detailed look at PCA Reviewed linear algebra Today: More linear algebra Multivariate.
Participant Presentations Please Sign Up: Name (Onyen is fine, or …) Are You ENRolled? Tentative Title (???? Is OK) When: Next Week, Early, Oct.,
Threshold selection in gene co- expression networks using spectral graph theory techniques Andy D Perkins*,Michael A Langston BMC Bioinformatics 1.
Basis State Prediction of Cell-Cycle Transcription Factors in Saccharomyces cerevisiae Dr. Matteo Pellegrini Dr. Shawn Cokus Sherri Rose UCLA Molecular,
The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.
Microarray analysis Algorithms in Computational Biology Spring 2006 Written by Itai Sharon.
Exploring Microarray data Javier Cabrera. Outline 1.Exploratory Analysis Steps. 2.Microarray Data as Multivariate Data. 3.Dimension Reduction 4.Correlation.
PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Prof.Dr.Cevdet Demir
Matlab Software To Do Analyses as in Marron’s Talks Matlab Available from UNC Site License Download Software: Google “Marron Software”
Statistics – O. R. 892 Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research University of North Carolina.
Object Orie’d Data Analysis, Last Time Finished Algebra Review Multivariate Probability Review PCA as an Optimization Problem (Eigen-decomp. gives rotation,
Chapter 2 Dimensionality Reduction. Linear Methods
Statistics – O. R. 892 Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research University of North Carolina.
Object Orie’d Data Analysis, Last Time Gene Cell Cycle Data Microarrays and HDLSS visualization DWD bias adjustment NCI 60 Data Today: More NCI 60 Data.
Object Orie’d Data Analysis, Last Time
DNA microarray technology allows an individual to rapidly and quantitatively measure the expression levels of thousands of genes in a biological sample.
Return to Big Picture Main statistical goals of OODA: Understanding population structure –Low dim ’ al Projections, PCA … Classification (i. e. Discrimination)
1 UNC, Stat & OR Nonnegative Matrix Factorization.
Statistics – O. R. 891 Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research University of North Carolina.
Robust PCA Robust PCA 3: Spherical PCA. Robust PCA.
Object Orie’d Data Analysis, Last Time Discrimination for manifold data (Sen) –Simple Tangent plane SVM –Iterated TANgent plane SVM –Manifold SVM Interesting.
1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, II J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina.
SiZer Background Scale Space – Idea from Computer Vision Goal: Teach Computers to “See” Modern Research: Extract “Information” from Images Early Theoretical.
Maximal Data Piling Visual similarity of & ? Can show (Ahn & Marron 2009), for d < n: I.e. directions are the same! How can this be? Note lengths are different.
Common Property of Shape Data Objects: Natural Feature Space is Curved I.e. a Manifold (from Differential Geometry) Shapes As Data Objects.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Object Orie’d Data Analysis, Last Time Organizational Matters What is OODA? Visualization by Projection.
Return to Big Picture Main statistical goals of OODA: Understanding population structure –Low dim ’ al Projections, PCA … Classification (i. e. Discrimination)
Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.
PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Richard Brereton
Participant Presentations Please Sign Up: Name (Onyen is fine, or …) Are You ENRolled? Tentative Title (???? Is OK) When: Thurs., Early, Oct., Nov.,
Stat 31, Section 1, Last Time Course Organization & Website What is Statistics? Data types.
Principal Component Analysis (PCA)
Statistics – O. R. 893 Object Oriented Data Analysis Steve Marron Dept. of Statistics and Operations Research University of North Carolina.
Statistics – O. R. 892 Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research University of North Carolina.
Statistics – O. R. 893 Object Oriented Data Analysis Steve Marron Dept. of Statistics and Operations Research University of North Carolina.
Participant Presentations Please Prepare to Sign Up: Name (Onyen is fine, or …) Are You ENRolled? Tentative Title (???? Is OK) When: Next Week, Early,
Object Orie’d Data Analysis, Last Time PCA Redistribution of Energy - ANOVA PCA Data Representation PCA Simulation Alternate PCA Computation Primal – Dual.
1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, I J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina.
Object Orie’d Data Analysis, Last Time PCA Redistribution of Energy - ANOVA PCA Data Representation PCA Simulation Alternate PCA Computation Primal – Dual.
PCA as Optimization (Cont.) Recall Toy Example Empirical (Sample) EigenVectors Theoretical Distribution & Eigenvectors Different!
Participant Presentations Please Sign Up: Name (Onyen is fine, or …) Are You ENRolled? Tentative Title (???? Is OK) When: Next Week, Early, Oct.,
GWAS Data Analysis. L1 PCA Challenge: L1 Projections Hard to Interpret (i.e. Little Data Insight) Solution: 1)Compute PC Directions Using L1 2)Compute.
Object Orie’d Data Analysis, Last Time Reviewed Clustering –2 means Cluster Index –SigClust When are clusters really there? Q-Q Plots –For assessing Goodness.
1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, III J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina.
PCA Data Represent ’ n (Cont.). PCA Simulation Idea: given Mean Vector Eigenvectors Eigenvalues Simulate data from Corresponding Normal Distribution.
Statistical Smoothing In 1 Dimension (Numbers as Data Objects)
Principal Components Analysis ( PCA)
Object Orie’d Data Analysis, Last Time Organizational Matters
Cornea Data Main Point: OODA Beyond FDA Recall Interplay: Object Space  Descriptor Space.
Distance Weighted Discrim ’ n Based on Optimization Problem: For “Residuals”:
SigClust Statistical Significance of Clusters in HDLSS Data When is a cluster “really there”? Liu et al (2007), Huang et al (2014)
Object Orie’d Data Analysis, Last Time Finished NCI 60 Data Linear Algebra Review Multivariate Probability Review PCA as an Optimization Problem (Eigen-decomp.
Object Orie’d Data Analysis, Last Time DiProPerm Test –Direction – Projection – Permutation –HDLSS hypothesis testing –NCI 60 Data –Particulate Matter.
Data statistics and transformation revision Michael J. Watts
Statistical Smoothing
Return to Big Picture Main statistical goals of OODA:
SiZer Background Finance "tick data":
Exploring Microarray data
Functional Data Analysis
Radial DWD Main Idea: Linear Classifiers Good When Each Class Lives in a Distinct Region Hard When All Members Of One Class Are Outliers in a Random Direction.
Statistics – O. R. 881 Object Oriented Data Analysis
Statistics – O. R. 881 Object Oriented Data Analysis
Maximal Data Piling MDP in Increasing Dimensions:
Principal Component Analysis
Principal Nested Spheres Analysis
Marios Mattheakis and Pavlos Protopapas
Statistics – O. R. 891 Object Oriented Data Analysis
Presentation transcript:

Object Orie’d Data Analysis, Last Time Organizational Matters What is OODA? Visualization by Projection Object Space & Feature Space Curves as Data Data Representation Issues PCA visualization

Data Object Conceptualization Object Space  Feature Space Curves Images Manifolds Shapes Tree Space Trees

Functional Data Analysis, Toy EG I

Easy way to do these analyses Matlab software (user friendly?) available: Download & put in Matlab Path: General Smoothing Look first at: curvdatSM.m scatplotSM.m

Easy way to do these analyses Matlab software (user friendly?) available: ???????????????????????????? ??? Next time: Spend some time going through these As many students seem to want to use them

Time Series of Curves Again a “Set of Curves” But now Time Order is Important! An approach: Use color to code for time Start End

Time Series Toy E.g. Explore Question: “Is Horizontal Motion Linear Variation?” Example: Set of time shifted Gaussian densities View: Code time with colors as above

T. S. Toy E.g., Raw Data

T. S. Toy E.g., PCA View PCA gives “Modes of Variation” But there are Many… Intuitively Useful??? Like “harmonics”? Isn’t there only 1 mode of variation? Answer comes in 2-d scatterplots

T. S. Toy E.g., PCA Scatterplot

Where is the Point Cloud? Lies along a 1-d curve in So actually have 1-d mode of variation But a non-linear mode of variation Poorly captured by PCA (linear method) Will study more later

Chemo-metric Time Series Mass Spectrometry Measurements On an Aging Substance, called “Estane” Made over Logarithmic Time Grid, n = 60 Each is a Spectrum What about Time Evolution? Approach: PCA & Time Coloring

Chemo-metric Time Series Joint Work w/ E. Kober & J. Wendelberger Los Alamos National Lab Four Experimental Conditions: 1.Control 2.Aged 59 days in Dry Air 3.Aged 27 days in Humid Air 4.Aged 59 days in Humid Air

Chemo-metric Time Series, HA 27

Raw Data: All 60 spectra essentially the same “Scale” of mean is much bigger than variation about mean Hard to see structure of all 1600 freq’s Centered Data: Now can see different spectra Since mean subtracted off Note much smaller vertical axis

Chemo-metric Time Series, HA 27

Data zoomed to “important” freq’s: Raw Data: Now see slight differences Smoother “natural looking” spectra Centered Data: Differences in spectra more clear Maybe now have “real structure” Scale is important

Chemo-metric Time Series, HA 27

Use of Time Order Coloring: Raw Data: Can see a little ordering, not much Centered Data: Clear time ordering Shifting peaks? (compare to Raw) PC1: Almost everything? PC1 Residuals: Data nearly linear (same scale import’nt)

Chemo-metric Time Series, Control

PCA View Clear systematic structure Time ordering very important Reminiscent of Toy Example A clear 1-d curve in Feature Space Physical Explanation?

Toy Data Explanations Simple Chemical Reaction Model: Subst. 1 transforms into Subst. 2 Note: linear path in Feature Space

Toy Data Explanations Richer Chemical Reaction Model: Subst. 1  Subst. 2  Subst. 3 Curved path in Feat. Sp. 2 Reactions  Curve lies in 2-dim’al subsp.

Toy Data Explanations Another Chemical Reaction Model: Subst. 1  Subst. 2 & Subst. 5  Subst. 6 Curved path in Feat. Sp. 2 Reactions  Curve lies in 2-dim’al subsp.

Toy Data Explanations More Complex Chemical Reaction Model: 1  2  3  4 Curved path in Feat. Sp. (lives in 3-d) 3 Reactions  Curve lies in 3-dim’al subsp.

Toy Data Explanations Even More Complex Chemical Reaction Model: 1  2  3  4  5 Curved path in Feat. Sp. (lives in 4-d) 4 Reactions  Curve lies in 4-dim’al subsp.

Chemo-metric Time Series, Control

Suggestions from Toy Examples: Clearly 3 reactions under way Maybe a 4 th ??? Hard to distinguish from noise? Interesting statistical open problem!

Chemo-metric Time Series What about the other experiments? Recall: 1.Control 2.Aged 59 days in Dry Air 3.Aged 27 days in Humid Air 4.Aged 59 days in Humid Air Above results were “cherry picked”, to best makes points What about cases???

Scatterplot Matrix, Control Above E.g., maybe ~4d curve  ~4 reactions

Scatterplot Matrix, Da59 PC2 is “bleeding of CO2”, discussed below

Scatterplot Matrix, Ha27 Only “3-d + noise”?  Only 3 reactions

Scatterplot Matrix, Ha59 Harder to judge???

Object Space View, Control Terrible discretization effect, despite ~4d …

Object Space View, Da59 OK, except strange at beginning (CO2 …)

Object Space View, Ha27 Strong structure in PC1 Resid (d < 2)

Object Space View, Ha59 Lots at beginning, OK since “oldest”

Problem with Da59 What about strange behavior for DA59? Recall: PC2 showed “really different behavior at start” Chemists comments: Ignore this, should have started measuring later…

Problem with Da59 But still fun to look at broader spectra

Chemo-metric T. S. Joint View Throw them all together as big population Take Point Cloud View

Chemo-metric T. S. Joint View

Throw them all together as big population Take Point Cloud View Note 4d space of interest, driven by: 4 clusters (3d) PC1 of chemical reaction (1-d) But these don’t appear as the 4 PCs Chem. PC1 “spread over PC2,3,4” Essentially a “rotation of interesting dir’ns” How to “unrotate”???

Chemo-metric T. S. Joint View Interesting Variation: Remove cluster means Allows clear comparison of within curve variation

Chemo-metric T. S. Joint View (- mean)

Chemo-metric T. S. Joint View Interesting Variation: Remove cluster means Allows clear comparison of within curve variation: PC1 versus others are quite revealing (note different “rotations”) Others don’t show so much

Demography Data Joint Work with: Andres Alonso Univ. Carlos III, Madrid Mortality, as a function of age “ Chance of dying ”, for Males, in Spain of each 1-year age group Curves are years PCA of the family of curves

Demography Data PCA of the family of curves for Males Babies & elderly “ most mortal ” (Raw) All getting better over time (Raw & PC1) Except Influenza Pandemic (see Color Scale)Color Scale Middle age most mortal (PC2): –1918 –Early 1930s - Spanish Civil War –1980 – 1994 (then better) auto wrecks Decade Rounding (several places)

Demography Data PCA for Females in Spain Most aspects similar (see Color Scale)Color Scale No War Changes –Steady improvement until 70s (PC2) –When auto accidents kicked in

Demography Data PCA for Males in Switzerland Most aspects similar No decade rounding (better records) 1918 Flu – Different Color (PC2) (see Color Scale)Color Scale No War Changes –Steady improvement until 70s (PC2) –When auto accidents kicked in

Demography Data Dual PCA Idea: Rows and Columns trade places Terminology: from optimization Insights come from studying “primal” & “dual” problems

Primal / Dual PCA Consider “Data Matrix”

Primal / Dual PCA Consider “Data Matrix” Primal Analysis: Columns are data vectors

Primal / Dual PCA Consider “Data Matrix” Dual Analysis: Rows are data vectors

Demography Data Dual PCA Idea: Rows and Columns trade places Demographic Primal View: Curves are Years, Coord ’ s are Ages Demographic Dual View: Curves are Ages, Coord ’ s are Years Dual PCA View, Spanish Males

Demography Data Dual PCA View, Spanish Males Old people have const. mortality (raw) But improvement for rest (raw) Bad for 1918 (flu) & Spanish Civil War, but generally improving (mean) Improves for ages 1-6, then worse (PC1) Big Improvement for young (PC2) (Age Color Key)Age Color Key

Primal / Dual PCA Reference: Gabriel, K. R. (1971) The biplot display of matrices with application to principal component analysis, Biometrika, 58, 467. Will study more later “Centering” is a critical issue

Yeast Cell Cycle Data “ Gene Expression ” – Micro-array data Data (after major preprocessing): Expression “ level ” of: thousands of genes (d ~ 1,000s) but only dozens of “ cases ” (n ~ 10s) Interesting statistical issue: High Dimension Low Sample Size data (HDLSS)

Yeast Cell Cycle Data Data from: Spellman, P. T., Sherlock, G., Zhang, M.Q., Iyer, V.R., Anders, K., Eisen, M.B., Brown, P.O., Botstein, D. and Futcher, B. (1998), “ Comprehensive Identification of Cell Cycle-regulated Genes of the Yeast Saccharomyces cerevisiae by Microarray Hybridization ”, Molecular Biology of the Cell, 9,

Yeast Cell Cycle Data Analysis here is from: Zhao, X., Marron, J.S. and Wells, M.T. (2004) The Functional Data View of Longitudinal Data, Statistica Sinica, 14,

Yeast Cell Cycle Data Lab experiment: Chemically “ synchronize cell cycles ”, of yeast cells Do cDNA micro-arrays over time Used 18 time points, over “ about 2 cell cycles ” Studied 4,489 genes (whole genome) Time series view of data: 4,489 time series of length 18 Functional Data View: 4,489 “ curves ”

Yeast Cell Cycle Data, FDA View Central question: Which genes are “ periodic ” over 2 cell cycles?