Statistics – O. R. 893 Object Oriented Data Analysis Steve Marron Dept. of Statistics and Operations Research University of North Carolina.

Slides:



Advertisements
Similar presentations
Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When.
Advertisements

Stor 155, Section 2, Last Time Distributions (how are data “spread out”?) Visual Display: Histograms –Binwidth is critical Time Plots = Time Series Course.
Yeast Cell Cycles, Freq. 2 Proj. PCA on Freq. 2 Periodic Component Of Data.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Object Orie’d Data Analysis, Last Time Finished NCI 60 Data Started detailed look at PCA Reviewed linear algebra Today: More linear algebra Multivariate.
Dimensional reduction, PCA
‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Exploring Microarray data Javier Cabrera. Outline 1.Exploratory Analysis Steps. 2.Microarray Data as Multivariate Data. 3.Dimension Reduction 4.Correlation.
PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Prof.Dr.Cevdet Demir
Assumption of Homoscedasticity
Lecture II-2: Probability Review
Matlab Software To Do Analyses as in Marron’s Talks Matlab Available from UNC Site License Download Software: Google “Marron Software”
Separate multivariate observations
Statistics – O. R. 892 Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research University of North Carolina.
Object Orie’d Data Analysis, Last Time Finished Algebra Review Multivariate Probability Review PCA as an Optimization Problem (Eigen-decomp. gives rotation,
Describing distributions with numbers
Chapter 2 Dimensionality Reduction. Linear Methods
Principal Components Analysis BMTRY 726 3/27/14. Uses Goal: Explain the variability of a set of variables using a “small” set of linear combinations of.
Statistics – O. R. 892 Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research University of North Carolina.
Object Orie’d Data Analysis, Last Time Gene Cell Cycle Data Microarrays and HDLSS visualization DWD bias adjustment NCI 60 Data Today: More NCI 60 Data.
Object Orie’d Data Analysis, Last Time
Graphical Summary of Data Distribution Statistical View Point Histograms Skewness Kurtosis Other Descriptive Summary Measures Source:
Stat 155, Section 2, Last Time Numerical Summaries of Data: –Center: Mean, Medial –Spread: Range, Variance, S.D., IQR 5 Number Summary & Outlier Rule Transformation.
Statistics – O. R. 891 Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research University of North Carolina.
Object Orie’d Data Analysis, Last Time Gene Cell Cycle Data Microarrays and HDLSS visualization DWD bias adjustment NCI 60 Data Today: Detailed (math ’
1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, II J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina.
Stat 31, Section 1, Last Time Time series plots Numerical Summaries of Data: –Center: Mean, Medial –Spread: Range, Variance, S.D., IQR 5 Number Summary.
Statistics – OR 155, Section 2 J. S. Marron, Professor Department of Statistics and Operations Research.
SWISS Score Nice Graphical Introduction:. SWISS Score Toy Examples (2-d): Which are “More Clustered?”
Last Time Normal Distribution –Density Curve (Mound Shaped) –Family Indexed by mean and s. d. –Fit to data, using sample mean and s.d. Computation of Normal.
Data Science and Big Data Analytics Chap 3: Data Analytics Using R
Stat 31, Section 1, Last Time Distributions (how are data “spread out”?) Visual Display: Histograms Binwidth is critical Bivariate display: scatterplot.
Return to Big Picture Main statistical goals of OODA: Understanding population structure –Low dim ’ al Projections, PCA … Classification (i. e. Discrimination)
PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Richard Brereton
Participant Presentations Please Sign Up: Name (Onyen is fine, or …) Are You ENRolled? Tentative Title (???? Is OK) When: Thurs., Early, Oct., Nov.,
Stat 31, Section 1, Last Time Course Organization & Website What is Statistics? Data types.
Statistics – O. R. 892 Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research University of North Carolina.
Statistics – O. R. 893 Object Oriented Data Analysis Steve Marron Dept. of Statistics and Operations Research University of North Carolina.
Participant Presentations Please Prepare to Sign Up: Name (Onyen is fine, or …) Are You ENRolled? Tentative Title (???? Is OK) When: Next Week, Early,
Object Orie’d Data Analysis, Last Time PCA Redistribution of Energy - ANOVA PCA Data Representation PCA Simulation Alternate PCA Computation Primal – Dual.
1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, I J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina.
Object Orie’d Data Analysis, Last Time PCA Redistribution of Energy - ANOVA PCA Data Representation PCA Simulation Alternate PCA Computation Primal – Dual.
1 Statistics & R, TiP, 2011/12 Multivariate Methods  Multivariate data  Data display  Principal component analysis Unsupervised learning technique 
Participant Presentations Please Sign Up: Name (Onyen is fine, or …) Are You ENRolled? Tentative Title (???? Is OK) When: Next Week, Early, Oct.,
GWAS Data Analysis. L1 PCA Challenge: L1 Projections Hard to Interpret (i.e. Little Data Insight) Solution: 1)Compute PC Directions Using L1 2)Compute.
Object Orie’d Data Analysis, Last Time Reviewed Clustering –2 means Cluster Index –SigClust When are clusters really there? Q-Q Plots –For assessing Goodness.
PCA Data Represent ’ n (Cont.). PCA Simulation Idea: given Mean Vector Eigenvectors Eigenvalues Simulate data from Corresponding Normal Distribution.
Principal Components Analysis ( PCA)
1 Objective To provide background material in support of topics in Digital Image Processing that are based on matrices and/or vectors. Review Matrices.
Object Orie’d Data Analysis, Last Time Organizational Matters
Cornea Data Main Point: OODA Beyond FDA Recall Interplay: Object Space  Descriptor Space.
Selecting Input Probability Distributions. 2 Introduction Part of modeling—what input probability distributions to use as input to simulation for: –Interarrival.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Statistical Smoothing
Last Time Proportions Continuous Random Variables Probabilities
SiZer Background Finance "tick data":
Linear Algebra Review.
Object Orie’d Data Analysis, Last Time
Exploring Microarray data
Stat 31, Section 1, Last Time Sampling Distributions
Statistics – O. R. 881 Object Oriented Data Analysis
Statistics – O. R. 881 Object Oriented Data Analysis
Participant Presentations
Participant Presentations
Dimension reduction : PCA and Clustering
Participant Presentations
Statistics – O. R. 891 Object Oriented Data Analysis
Presentation transcript:

Statistics – O. R. 893 Object Oriented Data Analysis Steve Marron Dept. of Statistics and Operations Research University of North Carolina

Administrative Info Details on Course Web Page Or: –Google: “Marron Courses” –Choose This Course Go Through These

Object Oriented Data Analysis What is it? A Sound-Bite Explanation: What is the “atom of the statistical analysis”? 1 st Course: Numbers Multivariate Analysis Course : Vectors Functional Data Analysis: Curves More generally: Data Objects

Object Oriented Data Analysis Three Major Parts of OODA Applications: I. Object Definition “What are the Data Objects?” II.Exploratory Analysis “What Is Data Structure / Drivers?” III. Confirmatory Analysis / Validation Is it Really There (vs. Noise Artifact)?

Object Oriented Data Analysis I. Object Definition “What are the Data Objects?” Generally Not Widely Appreciated

Object Oriented Data Analysis II.Exploratory Analysis “What Is Data Structure / Drivers?” Understood by Some in Statistics Classical Reference: Tukey (1977) Better Understood in Machine Learning

Object Oriented Data Analysis III. Confirmatory Analysis / Validation Is it Really There (vs. Noise Artifact)? Primary Focus of Modern “Statistics” E.g. STOR & Biostat PhD Curriculum Less So In Machine Learning

Interesting Data Set: Mortality Data For Spanish Males (thus can relate to history) Each curve is a single year x coordinate is age Note: Choice made of Data Object (could also study age as curves, x coordinate = time) Functional Data Analysis I. Object Definition

Interesting Data Set: Mortality Data For Spanish Males (thus can relate to history) Each curve is a single year x coordinate is age Mortality = # died / total # (for each age) Study on log scale Another Data Object Choice (not about experimental units) Functional Data Analysis I. Object Definition

Mortality Time Series Improved Coloring: Rainbow Representing Year: Magenta = 1908 Red = 2002 II. Exploratory Analysis

Mortality Time Series Scores Plot Descriptor (Point Cloud) Space View Connecting Lines Highlight Time Order Good View of Historical Effects

T. S. Toy E.g., PCA View PCA gives “Modes of Variation” But there are Many… Intuitively Useful??? Like “harmonics”? Isn’t there only 1 mode of variation? Answer comes in scores scatterplots

T. S. Toy E.g., PCA Scatterplot

Chemo-metric Time Series, Control II. Exploratory Analysis

Functional Data Analysis Interesting Real Data Example Genetics (Cancer Research) RNAseq (Next Gener’n Sequen’g) Deep look at “gene components” Microarrays: Single number (per gene) RNAseq: Thousands of measurements I. Object Definition

Interesting Real Data Example Genetics (Cancer Research) RNAseq (Next Gener’n Sequen’g) Deep look at “gene components” Gene studied here: CDNK2A Goal: Study Alternate Splicing Sample Size, n = 180 Dimension, d = ~1700 Functional Data Analysis

Simple 1 st View: Curve Overlay (log scale) Functional Data Analysis I. Object Definition

Often Useful Population View: PCA Scores Functional Data Analysis

Suggestion Of Clusters ??? Functional Data Analysis

Suggestion Of Clusters Which Are These? Functional Data Analysis

Manually “Brush” Clusters Functional Data Analysis II. Exploratory Analysis

Manually Brush Clusters Clear Alternate Splicing Functional Data Analysis II. Exploratory Analysis

Important Points PCA found Important Structure In High Dimensional Data Analysis d ~ 1700 Functional Data Analysis

Consequences:  Led to Development of SigFuge  Whole Genome Scan Found Interesting Genes  Published in Kimes, et al (2014) Functional Data Analysis

Interesting Question: When are clusters really there? (will study later) Functional Data Analysis III. Confirmatory Analysis

Course Background I Linear Algebra Please Check Familiarity No? Read Up in Linear Algebra Text Or Wikipedia?

Course Background I

Linear Algebra Key Concepts Spectral Representation Pythagorean Theorem ANOVA Decomposition (Sums of Squares) Parseval Identity / Inequality Projection (Vector onto a Subspace) Projection Operator / Matrix (Real) Unitary Matrices

Course Background I Linear Algebra Key Concepts Late will look at more carefully: Singular Value Decomposition Eigenanalysis Generalized Inverse

Course Background II MultiVariate Probability Again Please Check Familiarity No? Read Up in Probability Text Or Wikipedia?

Course Background II MultiVariate Probability Probability Distributions Theoretical (Underlying Population) Mean Theoretical Covariance Matrix Random Sampling Sample Mean Sample Covariance Matrix

Course Background II

Review of Multivar. Prob. (Cont.)

Outer Product Representation:

Review of Multivar. Prob. (Cont.) Outer Product Representation:, Where:

Review of Multivar. Prob. (Cont.)

Limitation of PCA PCA can provide useful projection directions

Limitation of PCA PCA can provide useful projection directions But can’t “see everything”…

Limitation of PCA PCA can provide useful projection directions But can’t “see everything”… Reason: PCA finds dir’ns of maximal variation Which may obscure interesting structure

Limitation of PCA Toy Example: Apple – Banana – Pear

Limitation of PCA, Toy E.g.

Limitation of PCA Toy Example: Apple – Banana – Pear Obscured by “noisy dimensions” 1 st 3 PC directions only show noise

Limitation of PCA Toy Example: Apple – Banana – Pear Obscured by “noisy dimensions” 1 st 3 PC directions only show noise Study some rotations, to find structure

Limitation of PCA, Toy E.g.

Limitation of PCA Toy Example: Rotation shows Apple – Banana – Pear Example constructed as: 1 st make these in 3-d Add 3 dimensions of high s.d. noise Carefully watch axis labels

Limitation of PCA Main Point:  May be Important Data Structure  Not Visible in 1 st Few PCs

Limitation of PCA, E.g. Interesting Data Set: NCI-60  NCI = National Cancer Institute  60 Cell Lines (cancer treatment targets) For Different Cancer Types  Measured “Gene Expression” = “Gene Activity”  Several Thousand Genes (Simultaneously)  Data Objects = Vectors of Gene Exp’n  Lots of Preprocessing (study later)

NCI 60 Data Important Aspect: 8 Cancer Types Renal Cancer Non Small Cell Lung Cancer Central Nervous System Cancer Ovarian Cancer Leukemia Cancer Colon Cancer Breast Cancer Melanoma (Skin)

PCA Visualization of NCI 60 Data Can we find classes: (Renal, CNS, Ovar, Leuk, Colon, Melan) Using PC directions?? First try “unsupervised view” I.e. switch off class colors Then turn on colors, to identify clusters I.e. look at “supervised view” II. Exploratory Analysis

NCI 60: Can we find classes Using PCA view?

PCA Visualization of NCI 60 Data Maybe need to look at more PCs? Study array of such PCA projections:

NCI 60: Can we find classes Using PCA 5-8?

NCI 60: Can we find classes Using PCA 9-12?

NCI 60: Can we find classes Using PCA 1-4 vs. 5-8?

NCI 60: Can we find classes Using PCA 1-4 vs. 9-12?

NCI 60: Can we find classes Using PCA 5-8 vs. 9-12?

PCA Visualization of NCI 60 Data Can we find classes using PC directions?? Found some, but not others Nothing after 1 st five PCs Rest seem to be noise driven Are There Better Directions? PCA only “feels” maximal variation Ignores Class Labels How Can We Use Class Labels?

Matlab Software Want to try similar analyses? Matlab Available from UNC Site License Download Software: Google “Marron Software”

Matlab Software Choose

Matlab Software Download.zip File, & Expand to 3 Directories

Matlab Software Put these in Matlab Path

Matlab Software Put these in Matlab Path

Matlab Basics Matlab has Modalities:  Interpreted (Type Commands & Run Individually)  Batch (Run “Script Files” = Command Sets)

Matlab Basics Matlab in Interpreted Mode:

Matlab Basics Matlab in Interpreted Mode:

Matlab Basics Matlab in Interpreted Mode:

Matlab Basics Matlab in Interpreted Mode:

Matlab Basics Matlab in Interpreted Mode:

Matlab Basics Matlab in Interpreted Mode:

Matlab Basics Matlab in Interpreted Mode: For description of a function: >> help [function name]

Matlab Basics Matlab in Interpreted Mode:

Matlab Basics Matlab in Interpreted Mode: To Find Functions: >> help [category name] e.g. >> help stats

Matlab Basics Matlab in Interpreted Mode:

Matlab Basics Matlab has Modalities:  Interpreted (Type Commands)  Batch (Run “Script Files”) For Serious Scientific Computing: Always Run Scripts

Matlab Basics Matlab Script File:  Just a List of Matlab Commands  Matlab Executes Them in Order Why Bother (Why Not Just Type Commands)? Reproducibility (Can Find Mistakes & Use Again Much Later)

Matlab Script Files An Example: Recall “Brushing Analysis” of Next Generation Sequencing Data

Simple 1 st View: Curve Overlay (log scale) Functional Data Analysis

Often Useful Population View: PCA Scores Functional Data Analysis

Suggestion Of Clusters ??? Functional Data Analysis

Suggestion Of Clusters Which Are These? Functional Data Analysis

Manually “Brush” Clusters Functional Data Analysis

Manually Brush Clusters Clear Alternate Splicing Functional Data Analysis

Matlab Script Files An Example: Recall “Brushing Analysis” of Next Generation Sequencing Data Analysis In Script File: VisualizeNextGen2011.m Matlab Script File Suffix

Matlab Script Files An Example: Recall “Brushing Analysis” of Next Generation Sequencing Data Analysis In Script File: VisualizeNextGen2011.m Matlab Script File Suffix On Course Web Page

Matlab Script Files String of Text

Matlab Script Files Command to Display String to Screen

Matlab Script Files Notes About Data (Maximizes Reproducibility)

Matlab Script Files Have Index for Each Part of Analysis

Matlab Script Files So Keep Everything Done (Max’s Reprod’ity)

Matlab Script Files Note Some Are Graphics Shown (Can Repeat)

Matlab Script Files Set Graphics to Default

Matlab Script Files Put Different Program Parts in IF-Block

Matlab Script Files Comment Out Currently Unused Commands

Matlab Script Files Read Data from Excel File

Matlab Script Files For Generic Functional Data Analysis:

Matlab Script Files Input Data Matrix

Matlab Script Files Structure, with Other Settings

Matlab Script Files Make Scores Scatterplot

Matlab Script Files Uses Careful Choice of Color Matrix

Matlab Script Files Start with PCA

Matlab Script Files Then Create Color Matrix

Matlab Script Files Black Red Blue

Matlab Script Files Run Script Using Filename as a Command

Big Picture Data Visualization

Marginal Distribution Plots Ideas:  For Each Variable  Study 1-d Distributions  SimultaneousViews:  Jitter Plots  Smooth Histograms (Kernel Density Est’s)

Marginal Distribution Plots

Representative Subset:  Sort on a Data Summary  E.g Mean, S.D., Skewness, Median, …  Usually Take Nearly Equally Spaced Grid  Sometimes Specify  E.g. All Largest (or Smallest) Summaries

Marginal Distribution Plots Note Often Useful for: I.Object Definition  Which Variables to Include?  How to Scale Them?  Need to Transform (e.g. log)?

Marginal Distribution Plots

Summary Plot with Equal Spacing

Marginal Distribution Plots Dashed Lines Correspond To 1-d Distn’s

Marginal Distribution Plots Wide Range Of Poissons

Marginal Distribution Plots

Additional Summaries: Skewness Measure Asymmetry Left Skewed Right Skewed (Thanks to Wikipedia)

Marginal Distribution Plots

Additional Summaries: Kurtosis 4-th Moment Summary Nice Graphic (from mvpprograms.com)

Marginal Distribution Plots Additional Summaries: Kurtosis 4-th Moment Summary Positive Big at Peak And Tails (with controlled variance)

Marginal Distribution Plots Additional Summaries: Kurtosis 4-th Moment Summary Negative Big in Flanks

Marginal Distribution Plots