Participant Presentations Please Prepare to Sign Up: Name Email (Onyen is fine, or …) Are You ENRolled? Tentative Title (???? Is OK) When: Next Week, Early,

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When.
Order Structure, Correspondence, and Shape Based Categories Presented by Piotr Dollar October 24, 2002 Stefan Carlsson.
Surface normals and principal component analysis (PCA)
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Beginning the Visualization of Data
Object Orie’d Data Analysis, Last Time Finished NCI 60 Data Started detailed look at PCA Reviewed linear algebra Today: More linear algebra Multivariate.
1 Analysis of Variance This technique is designed to test the null hypothesis that three or more group means are equal.
Dimensional reduction, PCA
‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E.
Exploring Microarray data Javier Cabrera. Outline 1.Exploratory Analysis Steps. 2.Microarray Data as Multivariate Data. 3.Dimension Reduction 4.Correlation.
PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Prof.Dr.Cevdet Demir
Matlab Software To Do Analyses as in Marron’s Talks Matlab Available from UNC Site License Download Software: Google “Marron Software”
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
Statistics – O. R. 892 Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research University of North Carolina.
Chapter 2 Dimensionality Reduction. Linear Methods
Principal Components Analysis BMTRY 726 3/27/14. Uses Goal: Explain the variability of a set of variables using a “small” set of linear combinations of.
Statistics – O. R. 892 Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research University of North Carolina.
Object Orie’d Data Analysis, Last Time OODA in Image Analysis –Landmarks, Boundary Rep ’ ns, Medial Rep ’ ns Mildly Non-Euclidean Spaces –M-rep data on.
Object Orie’d Data Analysis, Last Time
Machine Vision for Robots
Graphical Summary of Data Distribution Statistical View Point Histograms Skewness Kurtosis Other Descriptive Summary Measures Source:
ITEC6310 Research Methods in Information Technology Instructor: Prof. Z. Yang Course Website: c6310.htm Office:
Sample size vs. Error A tutorial By Bill Thomas, Colby-Sawyer College.
Robust PCA Robust PCA 3: Spherical PCA. Robust PCA.
Skewness & Kurtosis: Reference
Object Orie’d Data Analysis, Last Time Discrimination for manifold data (Sen) –Simple Tangent plane SVM –Iterated TANgent plane SVM –Manifold SVM Interesting.
1 UNC, Stat & OR DWD in Face Recognition, (cont.) Interesting summary: Jump between means (in DWD direction) Clear separation of Maleness vs. Femaleness.
1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, II J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina.
SWISS Score Nice Graphical Introduction:. SWISS Score Toy Examples (2-d): Which are “More Clustered?”
Maximal Data Piling Visual similarity of & ? Can show (Ahn & Marron 2009), for d < n: I.e. directions are the same! How can this be? Note lengths are different.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Stat 31, Section 1, Last Time Distributions (how are data “spread out”?) Visual Display: Histograms Binwidth is critical Bivariate display: scatterplot.
Return to Big Picture Main statistical goals of OODA: Understanding population structure –Low dim ’ al Projections, PCA … Classification (i. e. Discrimination)
PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Richard Brereton
Participant Presentations Please Sign Up: Name (Onyen is fine, or …) Are You ENRolled? Tentative Title (???? Is OK) When: Thurs., Early, Oct., Nov.,
Stat 31, Section 1, Last Time Course Organization & Website What is Statistics? Data types.
Statistics – O. R. 893 Object Oriented Data Analysis Steve Marron Dept. of Statistics and Operations Research University of North Carolina.
Statistics – O. R. 892 Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research University of North Carolina.
Principal Component Analysis and Linear Discriminant Analysis for Feature Reduction Jieping Ye Department of Computer Science and Engineering Arizona State.
22 January 2009 David1 Look at dead material and fake MET in Jx samples mc08 10 TeV simulations, release J0 to J6 are tag s479_r586, ‘ideal geometry’
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 –Multiple hypothesis testing Marshall University Genomics.
Object Orie’d Data Analysis, Last Time PCA Redistribution of Energy - ANOVA PCA Data Representation PCA Simulation Alternate PCA Computation Primal – Dual.
1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, I J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina.
Object Orie’d Data Analysis, Last Time PCA Redistribution of Energy - ANOVA PCA Data Representation PCA Simulation Alternate PCA Computation Primal – Dual.
Participant Presentations Please Sign Up: Name (Onyen is fine, or …) Are You ENRolled? Tentative Title (???? Is OK) When: Next Week, Early, Oct.,
GWAS Data Analysis. L1 PCA Challenge: L1 Projections Hard to Interpret (i.e. Little Data Insight) Solution: 1)Compute PC Directions Using L1 2)Compute.
Object Orie’d Data Analysis, Last Time Reviewed Clustering –2 means Cluster Index –SigClust When are clusters really there? Q-Q Plots –For assessing Goodness.
The Normal Approximation for Data. History The normal curve was discovered by Abraham de Moivre around Around 1870, the Belgian mathematician Adolph.
Principal Components Analysis ( PCA)
Object Orie’d Data Analysis, Last Time Organizational Matters
Cornea Data Main Point: OODA Beyond FDA Recall Interplay: Object Space  Descriptor Space.
Recall Flexibility From Kernel Embedding Idea HDLSS Asymptotics & Kernel Methods.
Automatic Data Transformation Qing Feng Joint work with Dr. Jan Hannig, Dr. J. S. Marron Mar. 22th, 2016.
Distance Weighted Discrim ’ n Based on Optimization Problem: For “Residuals”:
SigClust Statistical Significance of Clusters in HDLSS Data When is a cluster “really there”? Liu et al (2007), Huang et al (2014)
Unsupervised Learning
Statistical Smoothing
Return to Big Picture Main statistical goals of OODA:
SiZer Background Finance "tick data":
In Search of the Optimal Set of Indicators when Classifying Histopathological Images Catalin Stoean University of Craiova, Romania
Dimension Reduction via PCA (Principal Component Analysis)
Radial DWD Main Idea: Linear Classifiers Good When Each Class Lives in a Distinct Region Hard When All Members Of One Class Are Outliers in a Random Direction.
Statistics – O. R. 881 Object Oriented Data Analysis
Maximal Data Piling MDP in Increasing Dimensions:
Participant Presentations
Midterm Exam Closed book, notes, computer Similar to test 1 in format:
Midterm Exam Closed book, notes, computer Similar to test 1 in format:
Participant Presentations
Presentation transcript:

Participant Presentations Please Prepare to Sign Up: Name (Onyen is fine, or …) Are You ENRolled? Tentative Title (???? Is OK) When: Next Week, Early, Oct., Nov., Late

Object Oriented Data Analysis Three Major Parts of OODA Applications: I. Object Definition “What are the Data Objects?” II.Exploratory Analysis “What Is Data Structure / Drivers?” III. Confirmatory Analysis / Validation Is it Really There (vs. Noise Artifact)?

Interesting Real Data Example Genetics (Cancer Research) RNAseq (Next Gener’n Sequen’g) Deep look at “gene components” Gene studied here: CDNK2A Goal: Study Alternate Splicing Sample Size, n = 180 Dimension, d = ~1700 Functional Data Analysis

Simple 1 st View: Curve Overlay (log scale) Functional Data Analysis I. Object Definition

Often Useful Population View: PCA Scores Functional Data Analysis

Manually “Brush” Clusters Functional Data Analysis II. Exploratory Analysis

Manually Brush Clusters Clear Alternate Splicing Functional Data Analysis II. Exploratory Analysis

Course Background I Linear Algebra Please Check Familiarity No? Read Up in Linear Algebra Text Or Wikipedia?

Course Background II MultiVariate Probability Again Please Check Familiarity No? Read Up in Probability Text Or Wikipedia?

Limitation of PCA Toy Example: Apple – Banana – Pear Obscured by “noisy dimensions” 1 st 3 PC directions only show noise Study some rotations, to find structure

Limitation of PCA, Toy E.g.

Limitation of PCA Main Point:  May be Important Data Structure  Not Visible in 1 st Few PCs

PCA Visualization of NCI 60 Data Can we find classes: (Renal, CNS, Ovar, Leuk, Colon, Melan) Using PC directions?? First try “unsupervised view” I.e. switch off class colors Then turn on colors, to identify clusters I.e. look at “supervised view” II. Exploratory Analysis

NCI 60: Can we find classes Using PCA view?

NCI 60: Can we find classes Using PCA 1-4 vs. 5-8?

NCI 60: Can we find classes Using PCA 5-8 vs. 9-12?

PCA Visualization of NCI 60 Data Can we find classes using PC directions?? Found some, but not others Nothing after 1 st five PCs Rest seem to be noise driven Are There Better Directions? PCA only “feels” maximal variation Ignores Class Labels How Can We Use Class Labels?

Visualization of NCI 60 Data How Can We Use Class Labels? Approach: o Find Directions to “Best Separate” Classes o In Disjoint Pairs (thus 4 Directions) o Use DWD: Distance Weighted Discrimination o Defined (& Motivated) Later o Project All Data on These 4 Directions

NCI 60: Views using DWD Dir ’ ns (focus on biology)

DWD Visualization of NCI 60 Data Most cancer types clearly distinct (Renal, CNS, Ovar, Leuk, Colon, Melan) Using these carefully chosen directions Others less clear cut –NSCLC (at least 3 subtypes) –Breast (4 published subtypes)

DWD Visualization Recall PCA limitations DWD uses class info Hence can “better separate known classes” Do this for pairs of classes (DWD just on those, ignore others) Carefully choose pairs in NCI 60 data Note DWD Directions Not Orthogonal (PCA orthogonality may be too strong a constraint)

NCI 60: Views using DWD Dir ’ ns (focus on biology)

PCA Visualization of NCI 60 Data Can we find classes using PC directions?? Found some, but not others Not so distinct as in DWD view Nothing after 1 st five PCs Rest seem to be noise driven Orthogonality too strong a constraint??? Interesting dir’ns are nearly orthogonal

Matlab Software Want to try similar analyses? Matlab Available from UNC Site License Download Software: Google “Marron Software”

Matlab Software Download.zip File, & Expand to 4 Directories

Big Picture Data Visualization

Marginal Distribution Plots Ideas:  For Each Variable  Study 1-d Distributions  SimultaneousViews:  Jitter Plots  Smooth Histograms (Kernel Density Est’s)

Marginal Distribution Plots Dashed Lines Correspond To 1-d Distn’s

Marginal Distribution Plots Additional Summaries: Skewness Measure Asymmetry Left Skewed Right Skewed (Thanks to Wikipedia)

Marginal Distribution Plots Additional Summaries: Kurtosis 4-th Moment Summary Nice Graphic (from mvpprograms.com)

Marginal Distribution Plots Additional Summaries: Kurtosis 4-th Moment Summary Positive Big at Peak And Tails (with controlled variance)

Marginal Distribution Plots Additional Summaries: Kurtosis 4-th Moment Summary Negative Big in Flanks

Marginal Distribution Plots

Matlab Software: MargDistPlotSM.m In General Directory

Marg. Dist. Plot Data Example

Drug Discovery – PCA Scatterplot Dominated By Few Large Compounds Not Good Blue - Red Separation

Marg. Dist. Plot Data Example Drug Discovery – Sort on Means Very Few Are Crazy Big Will They Dominate PCA???

Marg. Dist. Plot Data Example Drug Discovery – Sort on Means Note: Many With No Variation???

Marg. Dist. Plot Data Example Drug Discovery – Sort on Means Suspicious Value ???

Marg. Dist. Plot Data Example Drug Discovery – Sort on Means Note: Sub-Densities Highlight Blue – Red Differences

Marg. Dist. Plot Data Example Drug Discovery – Sort on Means Note: Descriptor Names

Marg. Dist. Plot Data Example Drug Discovery – Sort on Standard Deviation Very Large Number of 0-Variation Variables (No Useful Info…)

Marg. Dist. Plot Data Example Drug Discovery – Sort on Means Investigate Weird -999 Values Note: Sometimes Such Values Are Used To Code Missing Values (And Nobody Remembers to Say So)

Marg. Dist. Plot Data Example Drug Discovery – Sort on Means Investigate Weird -999 Values, Via Mean Look at Smallest Mean Values (All Dashed Bars On Left)

Marg. Dist. Plot Data Example Drug Discovery – Sort on Means Investigate Weird -999 Values, Via Mean, Smallest 6 Variables Are All -999

Marg. Dist. Plot Data Example Drug Discovery – Sort on Means Investigate Weird -999 Values, Via Mean, Smallest 6 Variables Are All -999 Other Have Some -999

Marg. Dist. Plot Data Example Drug Discovery Focus on -999 Values, Using Minimum, Equally Spaced Not Too Many Such Variables

Marg. Dist. Plot Data Example Drug Discovery Focus More on -999 Values, Using Minimum, Smallest 15 (Again All Dashed Bars On Left)

Marg. Dist. Plot Data Example

Drug Discovery - Full Data PCA from Above (Include To Show Contrast)

Marg. Dist. Plot Data Example Drug Discovery - PCA After Variable Deletion Looks Very Similar!

Marg. Dist. Plot Data Example Drug Discovery - PCA After Variable Deletion Makes Sense For 0-Var Variables

Marg. Dist. Plot Data Example Drug Discovery - PCA After Variable Deletion -999s Have No Impact Since Others Much Bigger!

Marg. Dist. Plot Data Example Drug Discovery - After Variable Deletion Sort Marginals On Means, Equally Spaced Still Have Massive Variation In Types

Marg. Dist. Plot Data Example Drug Discovery - After Variable Deletion Sort Marginals On Means, Equally Spaced Still Have Very Few That Are Very Big

Marg. Dist. Plot Data Example Drug Discovery - Sort Marginals On SDs, Equally Spaced To Study Variation Very Wide Range Standardization May Be Useful

Marg. Dist. Plot Data Example Drug Discovery - Sort Marginals On SDs, Equally Spaced To Study Variation Now See Some Very Small How Many?

Marg. Dist. Plot Data Example Drug Discovery - Sort Marginals On SDs, Smallest Several Very Small Variables Some Very Skewed???

Marg. Dist. Plot Data Example Drug Discovery - Sort On Skewness Wide Range (Consider Transform- ation) 0-1s Very Prominent

Marg. Dist. Plot Data Example

Drug Discovery - Sort On Number Unique Shows Many Binary

Marg. Dist. Plot Data Example Drug Discovery - Sort On Number Unique Shows Many Binary Few Truly Continuous

Marg. Dist. Plot Data Example Drug Discovery - Sort On Number of Most Frequent Shows Many Have a Very Common Value (Often 0)

Marg. Dist. Plot Data Example

Drug Discovery - PCA on Binary Variables Interesting Structure?

Marg. Dist. Plot Data Example Drug Discovery - PCA on Binary Variables Interesting Structure? Clusters?

Marg. Dist. Plot Data Example Drug Discovery - PCA on Binary Variables Interesting Structure? Clusters? Stronger Red vs. Blue

Marg. Dist. Plot Data Example Drug Discovery - Binary Variables, Mean Sort Shows Many Are Mostly 0s

Marg. Dist. Plot Data Example

Drug Discovery - PCA on non-Binary Variables Focus Only on Binary Variables Interesting Structure?

Marg. Dist. Plot Data Example Drug Discovery - PCA on non-Binary Variables Focus Only on Binary Variables Interesting Structure? Suggests Subtypes

Marg. Dist. Plot Data Example Drug Discovery - PCA on non-Binary Variables Focus Only on Binary Variables Interesting Structure? Suggests Subtypes Reds Only?

Transformations  Useful Method for Data Analysts  Apply to Marginal Distributions (i.e. to Individual Variables)  Idea: Put Data on Right Scale  Common Example: o Data Orders of Magnitude Different o Log 10 Puts Data on More Analyzable Scale

Transformations Example: Melanoma Data Analysis Miedema, et al (2012)

March 17, 2010, 80 Clinical diagnosis BackgroundIntroduction

March 17, 2010, 81 Image Analysis of Histology Slides GoalBackground Melanoma Image: Benign 1 in 75 North Americans will develop a malignant melanoma in their lifetime. Initial goal: Automatically segment nuclei. Challenge: Dense packing of nuclei. Ultimately:Cancer grading and patient survival. Image: melanoma.blogsome.com

March 17, 2010, 82 Cutting Cell Clusters Cell Nuclei from HistologySegmentation With subdivisionWithout subdivision Not perfect, but improved over standard algorithm (rq. validation). Our segmentation Commercial seg.

March 17, 2010, 83 Feature Extraction Features from Cell NucleiFeature Extraction Extract various features based on color and morphology Example “high-level” concepts: Stain intensity Nuclear area Density of nuclei Regularity of nuclear shape

March 17, 2010, 84 Original Image Features from Cell NucleiFeature Extraction Conventional NevusSuperficial Spreading Melanoma

March 17, 2010, 85 Restained Image Features from Cell NucleiFeature Extraction Conventional NevusSuperficial Spreading Melanoma

March 17, 2010, 86 Masked Region Features from Cell NucleiFeature Extraction Conventional NevusSuperficial Spreading Melanoma Conventional NevusSuperficial Spreading Melanoma Masks generated by pathologist

March 17, 2010, 87 Segmented Nuclei Features from Cell NucleiFeature Extraction Conventional NevusSuperficial Spreading Melanoma

March 17, 2010, 88 Labeled Nuclei Features from Cell NucleiFeature Extraction Conventional NevusSuperficial Spreading Melanoma

March 17, 2010, 89 Nuclear Regions Features from Cell NucleiFeature Extraction Conventional NevusSuperficial Spreading Melanoma Generated by growing nuclei out from boundary Used for various color and density features: Region Stain 2, Region Area Ratio, etc.

March 17, 2010, 90 Delaunay Triangulation Features from Cell NucleiFeature Extraction Conventional NevusSuperficial Spreading Melanoma Triangulation of nuclear centers Used for various density features: Mean Delaunay, Max. Delaunay, etc.

Transformations Statistical Challenge: Some Features Spread Over Several Orders of Magnitude Few Big Values Can Dominate Analysis

Transformations Few Large Values Can Dominate Analysis

Transformations Log Scale Gives All More Appropriate Weight Note: Orders Of Magnitude Log 10 most Interpretable

Transformations Another Feature with Large Values Problem For This Feature: 0 Values (no log)

Transformations

Much Nicer Distribution

Transformations Different Direction (Negative) of Skewness

Transformations Use Log Difference Transformation

Yeast Cell Cycle Data Another Example Showing Interesting Directions Beyond PCA Exploratory Data Analysis

Yeast Cell Cycle Data “ Gene Expression ” – Microarray data Data (after major preprocessing): Expression “ level ” of: thousands of genes (d ~ 1,000s) but only dozens of “ cases ” (n ~ 10s) Interesting statistical issue: High Dimension Low Sample Size data (HDLSS)

Yeast Cell Cycle Data Data from: Spellman, et al (1998) Analysis here is from: Zhao, Marron & Wells (2004)

Yeast Cell Cycle Data Lab experiment: Chemically “ synchronize cell cycles ”, of yeast cells Do cDNA micro-arrays over time Used 18 time points, over “ about 2 cell cycles ” Studied 4,489 genes (whole genome) Time series view of data: have 4,489 time series of length 18 Functional Data View: have 18 “ curves ”, of dimension 4,489 What are the data objects?

Yeast Cell Cycle Data, FDA View Central question: Which genes are “ periodic ” over 2 cell cycles?

Yeast Cell Cycle Data, FDA View Periodic genes? Na ï ve approach: Simple PCA

Yeast Cell Cycle Data, FDA View Central question: which genes are “ periodic ” over 2 cell cycles? Na ï ve approach: Simple PCA No apparent (2 cycle) periodic structure? Eigenvalues suggest “ variation ” spread in many directions PCA finds “ directions of maximal variation ” Often, but not always, same as “ interesting directions ” Here need better approach to study periodicities

Yeast Cell Cycle Data, FDA View Approach Project on Period 2 Components Only Calculate via Fourier Representation To understand, study Fourier Basis