Radial DWD Main Idea: Linear Classifiers Good When Each Class Lives in a Distinct Region Hard When All Members Of One Class Are Outliers in a Random Direction.

Slides:



Advertisements
Similar presentations
ADHD Reaction Times: Densities, Mixed Effects, and PCA.
Advertisements

Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When.
Uncertainty Representation. Gaussian Distribution variance Standard deviation.
Participant Presentations Please Sign Up: Name (Onyen is fine, or …) Are You ENRolled? Tentative Title (???? Is OK) When: Next Week, Early, Oct.,
Kernel methods - overview
POSTER TEMPLATE BY: Cluster-Based Modeling: Exploring the Linear Regression Model Space Student: XiaYi(Sandy) Shen Advisor:
So are how the computer determines the size of the intercept and the slope respectively in an OLS regression The OLS equations give a nice, clear intuitive.
Statistics – O. R. 892 Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research University of North Carolina.
STAT 13 -Lecture 2 Lecture 2 Standardization, Normal distribution, Stem-leaf, histogram Standardization is a re-scaling technique, useful for conveying.
Inference for regression - Simple linear regression
Object Orie’d Data Analysis, Last Time Finished Algebra Review Multivariate Probability Review PCA as an Optimization Problem (Eigen-decomp. gives rotation,
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved. Chapter 12 Describing Data.
Handling Data and Figures of Merit Data comes in different formats time Histograms Lists But…. Can contain the same information about quality What is meant.
Object Orie’d Data Analysis, Last Time
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 1 Part 4 Curve Fitting.
Stat 155, Section 2, Last Time Numerical Summaries of Data: –Center: Mean, Medial –Spread: Range, Variance, S.D., IQR 5 Number Summary & Outlier Rule Transformation.
Return to Big Picture Main statistical goals of OODA: Understanding population structure –Low dim ’ al Projections, PCA … Classification (i. e. Discrimination)
Chapter 10 Correlation and Regression
Object Orie’d Data Analysis, Last Time Cornea Data & Robust PCA –Elliptical PCA Big Picture PCA –Optimization View –Gaussian Likelihood View –Correlation.
Robust PCA Robust PCA 3: Spherical PCA. Robust PCA.
Last Time Hypothesis Testing –1-sided vs. 2-sided Paradox Big Picture Goals –Hypothesis Testing –Margin of Error –Sample Size Calculations Visualization.
Object Orie’d Data Analysis, Last Time Discrimination for manifold data (Sen) –Simple Tangent plane SVM –Iterated TANgent plane SVM –Manifold SVM Interesting.
Object Orie’d Data Analysis, Last Time Cornea Data –Images (on the disk) as data objects –Zernike basis representations Outliers in PCA (have major influence)
1 UNC, Stat & OR DWD in Face Recognition, (cont.) Interesting summary: Jump between means (in DWD direction) Clear separation of Maleness vs. Femaleness.
Stat 31, Section 1, Last Time Time series plots Numerical Summaries of Data: –Center: Mean, Medial –Spread: Range, Variance, S.D., IQR 5 Number Summary.
Object Orie’d Data Analysis, Last Time
SiZer Background Scale Space – Idea from Computer Vision Goal: Teach Computers to “See” Modern Research: Extract “Information” from Images Early Theoretical.
SWISS Score Nice Graphical Introduction:. SWISS Score Toy Examples (2-d): Which are “More Clustered?”
Object Orie’d Data Analysis, Last Time Cornea Data & Robust PCA –Elliptical PCA Big Picture PCA –Optimization View –Gaussian Likelihood View –Correlation.
Stat 31, Section 1, Last Time Distributions (how are data “spread out”?) Visual Display: Histograms Binwidth is critical Bivariate display: scatterplot.
Methods for point patterns. Methods consider first-order effects (e.g., changes in mean values [intensity] over space) or second-order effects (e.g.,
Spatial Smoothing and Multiple Comparisons Correction for Dummies Alexa Morcom, Matthew Brett Acknowledgements.
Return to Big Picture Main statistical goals of OODA: Understanding population structure –Low dim ’ al Projections, PCA … Classification (i. e. Discrimination)
Stat 31, Section 1, Last Time Course Organization & Website What is Statistics? Data types.
Histograms h=0.1 h=0.5 h=3. Theoretically The simplest form of histogram B j = [(j-1),j)h.
Object Orie’d Data Analysis, Last Time PCA Redistribution of Energy - ANOVA PCA Data Representation PCA Simulation Alternate PCA Computation Primal – Dual.
Object Orie’d Data Analysis, Last Time PCA Redistribution of Energy - ANOVA PCA Data Representation PCA Simulation Alternate PCA Computation Primal – Dual.
Participant Presentations Please Sign Up: Name (Onyen is fine, or …) Are You ENRolled? Tentative Title (???? Is OK) When: Next Week, Early, Oct.,
GWAS Data Analysis. L1 PCA Challenge: L1 Projections Hard to Interpret (i.e. Little Data Insight) Solution: 1)Compute PC Directions Using L1 2)Compute.
Object Orie’d Data Analysis, Last Time Reviewed Clustering –2 means Cluster Index –SigClust When are clusters really there? Q-Q Plots –For assessing Goodness.
Statistical Smoothing In 1 Dimension (Numbers as Data Objects)
Recall Flexibility From Kernel Embedding Idea HDLSS Asymptotics & Kernel Methods.
Chapter 15: Exploratory data analysis: graphical summaries CIS 3033.
Distance Weighted Discrim ’ n Based on Optimization Problem: For “Residuals”:
SigClust Statistical Significance of Clusters in HDLSS Data When is a cluster “really there”? Liu et al (2007), Huang et al (2014)
Object Orie’d Data Analysis, Last Time DiProPerm Test –Direction – Projection – Permutation –HDLSS hypothesis testing –NCI 60 Data –Particulate Matter.
1 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Machine learning, pattern recognition and statistical data modelling.
Stat 31, Section 1, Last Time Linear transformations
Clustering Idea: Given data
Statistical Smoothing
Chapter 4: Basic Estimation Techniques
Practice As part of a program to reducing smoking, a national organization ran an advertising campaign to convince people to quit or reduce their smoking.
Last Time Proportions Continuous Random Variables Probabilities
Chapter 4 Basic Estimation Techniques
Data Transformation: Normalization
Object Orie’d Data Analysis, Last Time
CHAPTER 12 More About Regression
Statistical Data Analysis - Lecture /04/03
Signal processing.
Ch8: Nonparametric Methods
Support Vector Machines (SVM)
(will study more later)
Maximal Data Piling MDP in Increasing Dimensions:
ID1050– Quantitative & Qualitative Reasoning
CSCE 643 Computer Vision: Thinking in Frequency
Brief Review of Recognition + Context
CS5112: Algorithms and Data Structures for Applications
CHAPTER 12 More About Regression
Presentation transcript:

Radial DWD Main Idea: Linear Classifiers Good When Each Class Lives in a Distinct Region Hard When All Members Of One Class Are Outliers in a Random Direction 1

Radial DWD Hard When All Members Of One Class Are Outliers in a Random Direction Think HDLSS Version of 2

Virus Hunting Look at Data: Note: Heights Orders of Magnitude -5 << -3 Different 3

Radial DWD Thought Model: + Class Near Center - Class At Random Corners Note: ∃ Many (𝑑) Corners 4

Radial DWD Consider Training Data But What About Test Point In A Different Corner ??? 5

Virus Hunting Compare With Other Classifiers Even Kernel SVM Fails!!! Reason: Test Data Lies In Corners With No Training Data 6

Radial DWD Verify With Simulation Study Note: Very Similar Behavior! 7

Melanoma Data Recall From 9/5/17: Studied Transformation Of This Data

Melanoma Data Rotate To DWD Direction “Good” Separation ???

Melanoma Data Rotate To DWD & Ortho PCs Better Separation Than Full Data???

ROC Curve Challenge: Measure “Goodness of Separation” Approach from Signal Detection: Receiver Operator Characteristic (ROC) Curve Developed in WWII, History in Green and Swets (1966)

ROC Curve Summarize & Compare Using Area Under Curve (AUC)

“> 70% has Predictive Usefulness” ROC Curve Interpretation of AUC: Very Context Dependent Radiology: “> 70% has Predictive Usefulness” Varies Field by Field Bigger is Better

Melanoma Data Recall Question: Which Gives Better Separation of Melanoma vs. Nevi DWD on All Melanoma vs. All Nevi DWD on Melanoma 1 vs. Sev. Dys. Nevi

Melanoma Data Full Data ROC Analysis AUC = 0.93

Melanoma Data SubClass ROC Analysis AUC = 0.95 Better, Makes Intuitive Sense

(will study more later) Clusters in data Common Statistical Task: Find Clusters in Data Interesting sub-populations? Important structure in data? How to do this? PCA & visualization is very simple approach There is a large literature of other methods (will study more later)

PCA is limited at finding clusters PCA to find clusters Main Lesson: PCA is limited at finding clusters Revealed by NCI60 data Recall Microarray data With 8 known different cancer types PCA does not find all 8 clusters Recall only finds dirn’s of max variation Does not use class label information

PCA to find clusters A deeper example: Mass Flux Data Data from Enrica Bellone, National Center for Atmospheric Research Mass Flux for quantifying cloud types How does mass change when moving into a cloud

PCA to find clusters PCA of Mass Flux Data:

PCA to find clusters Summary of PCA of Mass Flux Data: Mean: Captures General Mountain Shape PC1: Generally overall height of peak shows up nicely in mean +- plot (2nd col) 3 apparent clusters in scores plot Are those “really there”? If so, could lead to interesting discovery If not, could waste effort in investigation

PCA to find clusters Summary of PCA of Mass Flux Data: PC2: Location of peak again mean +- plot very useful here PC3: Width adjustment again see most clearly in mean +- plot Maybe non-linear modes of variation???

(SIgnificance of ZERo crossings of deriv.) PCA to find clusters Return to Investigation of PC1 Clusters: Can see 3 bumps in smooth histogram Main Question: Important structure or sampling variability? Approach: SiZer (SIgnificance of ZERo crossings of deriv.)

Statistical Smoothing In 1 Dimension (Numbers as Data Objects)

Statistical Smoothing In 1 Dimension, 2 Major Settings: Density Estimation “Histograms” Nonparametric Regression “Scatterplot Smoothing”

Density Estimation E.g. Hidalgo Stamp Data Thicknesses of Postage Stamps Produced in Mexico Over ~ 70 years During 1800s Paper produced in several factories? How many factories? (Records lost) Brought to statistical literature by: Izenman and Sommer (1988)

(Thicknesses vary by up to factor of 2) Density Estimation E.g. Hidalgo Stamp Data Thicknesses of Postage Stamps Produced in Mexico Over ~ 70 years During 1800s Paper produced in several factories? (Thicknesses vary by up to factor of 2)

Density Estimation E.g. Hidalgo Stamp Data A histogram “Oversmoothed” Bin Width too large? Miss important structure?

Density Estimation E.g. Hidalgo Stamp Data Another histogram Smaller binwidth Suggests 2 modes? 2 factories making the paper?

Density Estimation E.g. Hidalgo Stamp Data Another histogram Smaller binwidth Suggests 6 modes? 6 factories making the paper?

Density Estimation E.g. Hidalgo Stamp Data Another histogram Even smaller binwidth Suggests many modes? Really believe modes are “there”? Or just sampling variation?

Choice of binwidth (well understood?) Density Estimation E.g. Hidalgo Stamp Data Critical Issue for histograms: Choice of binwidth (well understood?)

Histograms Less Well Understood Issue: Choice of bin location Major impact on number of modes (2-7) All for same binwidth

Histograms Choice of bin location: What is going on? Compare with Average Histogram

Density Estimation Compare shifts with Average Histogram For 7 mode shift Peaks line up with bin centers So shifted histo’s find peaks

Density Estimation Compare shifts with Average Histogram For 2 (3?) mode shift Peaks split between bins So shifted histo’s miss peaks

Density Estimation Histogram Drawbacks: Need to choose bin width Need to choose bin location But Average Histogram reveals structure So should use that, instead of histo Name: Kernel Density Estimate

Kernel Density Estimation Chondrite Data: Stony (metal) Meteorites (hit the earth) So have a chunk of rock Study “% of silica” From how many sources? Only 22 rocks… Histogram hopeless? Brought to statistical literature by: Good and Gaskins (1980)

Kernel Density Estimation Chondrite Data: Represent points by red bars Where are data “more dense”?

Kernel Density Estimation Chondrite Data: Put probability mass 1/n at each point Smooth piece of “density”

Kernel Density Estimation Chondrite Data: Sum pieces to estimate density Suggests 3 modes (rock sources)

Kernel Density Estimation Mathematical Notation: 𝑓 ℎ 𝑥 = 1 𝑛 𝑖=1 𝑛 𝐾 ℎ 𝑥− 𝑋 𝑖 Where Window shape given by “kernel”, 𝐾 ℎ ∙ = 1 ℎ 𝐾 ∙ ℎ Window width given by “bandwidth”, ℎ

Kernel Density Estimation Mathematical Notation: 𝑓 ℎ 𝑥 = 1 𝑛 𝑖=1 𝑛 𝐾 ℎ 𝑥− 𝑋 𝑖 This Was Used In PCA Graphics

Kernel Density Estimation Choice of kernel (window shape)? Controversial issue Want Computational Speed? Want Statistical Efficiency? Want Smooth Estimates? There is more, but personal choice: Gaussian Good Overall Reference: Wand and Jones (1994)

Kernel Density Estimation Choice of bandwidth (window width)? Very important to performance Fundamental Issue: Which modes are “really there”?

Density Estimation How to use histograms if you must: Undersmooth (minimizes bin edge effect) Human eye is OK at “post-smoothing”

Statistical Smoothing 2 Major Settings: Density Estimation “Histograms” Nonparametric Regression “Scatterplot Smoothing”

Scatterplot Smoothing E.g. Bralower Fossil Data Prof. of Geosciences Penn. State Univ.

Scatterplot Smoothing E.g. Bralower Fossil Data Study Global Climate Time scale of millions of years Data points are fossil shells Dated by surrounding material Ratio of Isotopes of Strontium (differences in 4th decimal point!) Surrogate for Sea Level (Ice Ages) Data seem to have structure…

Scatterplot Smoothing E.g. Bralower Fossil Data

Scatterplot Smoothing E.g. Bralower Fossil Data Way to bring out structure: Smooth the data Methods of smoothing? Local Averages Splines (several types) Fourier – trim high frequencies Other bases … Also controversial

Scatterplot Smoothing E.g. Bralower Fossil Data – some smooths

Scatterplot Smoothing A simple approach: local averages Given data: 𝑋 1 , 𝑌 1 ,⋯, 𝑋 𝑛 , 𝑌 𝑛 Model in regression form: 𝐸 𝑌 𝑖 =𝑚 𝑋 𝑖 , 𝑖=1,⋯,𝑛 How to estimate 𝑚 𝑥 ?

Scatterplot Smoothing A simple approach: local averages Given a kernel window function: 𝐾 ℎ 𝑥 = 1 ℎ 𝐾 𝑥 ℎ Estimate the curve 𝑚 𝑥 by a 𝐾 ℎ weighted local average: 𝑚 ℎ 𝑥 = 𝑖=1 𝑛 𝐾 ℎ 𝑥− 𝑋 𝑖 𝑌 𝑖 𝑖=1 𝑛 𝐾 ℎ 𝑥− 𝑋 𝑖

Scatterplot Smoothing Interesting representation of local average: Given kernel window weights, 𝐾 ℎ 𝑥 = 1 ℎ 𝐾 𝑥 ℎ Local Constant Least Squares Fit to Data: 𝑚 ℎ 𝑥 = 𝑎𝑟𝑔𝑚𝑖𝑛 𝑎 𝑖=1 𝑛 𝐾 ℎ 𝑥− 𝑋 𝑖 𝑌 𝑖 −𝑎 2

Scatterplot Smoothing Local Constant Fits (visually): “Moving Average” Window width ℎ is critical (~ k.d.e.)

Scatterplot Smoothing Interesting variation: Local linear fit to data: Given kernel window weights, 𝐾 ℎ 𝑥 = 1 ℎ 𝐾 𝑥 ℎ 𝑚 ℎ 𝑥 = 𝑎𝑟𝑔𝑚𝑖𝑛 𝑎 𝑖=1 𝑛 𝐾 ℎ 𝑥− 𝑋 𝑖 𝑌 𝑖 − 𝑏 𝑋 𝑖 +𝑎 2

Scatterplot Smoothing Local Linear Fits (visually): “Intercept of Moving Fit Line” Window width ℎ is critical (~ k.d.e.)

Scatterplot Smoothing Another variation: “Intercept of Moving Polynomial Fit” Window width ℎ is critical (~ k.d.e.)

Scatterplot Smoothing Local Polynomial Smoothing What is best polynomial degree? Once again controversial… Advocates for all of 0, 1, 2, 3. Depends on personal weighting of factors involved Good reference: Fan & Gijbels (1995) Personal choice: degree 1, local linear

Scatterplot Smoothing E.g. Bralower Fossils – local linear smooths

Scatterplot Smoothing Smooths of Bralower Fossil Data: Oversmoothed misses structure Undersmoothed feels sampling noise? About right shows 2 valleys: One seems clear Is other one really there? Same question as above…

Kernel Density Estimation Choice of bandwidth (window width)? Very important to performance Fundamental Issue: Which modes are “really there”?

Kernel Density Estimation Choice of bandwidth (window width)? Very important to performance Data Based Choice? Controversial Issue Many recommendations Suggested Reference: Jones, Marron & Sheather (1996) Never a consensus…

Kernel Density Estimation Choice of bandwidth (window width)? Alternate Choice: Consider all of them! I.e. look at whole spectrum of smooths Can see different structure At different smoothing levels Connection to Scale Space E.g. Stamps data How many modes? All answers are there….

Kernel Density Estimation

Statistical Smoothing Fundamental Question For both of Density Estimation: “Histograms” Regression: “Scatterplot Smoothing” Which bumps are “really there”? vs. “artifacts of sampling noise”?

Extract “Information” from Images SiZer Background Scale Space – Idea from Computer Vision Goal: Teach Computers to “See” Modern Research: Extract “Information” from Images Early Theoretical work

SiZer Background Scale Space – Idea from Computer Vision Conceptual basis: Oversmoothing = “view from afar” (macroscopic) Undersmoothing = “zoomed in view” (microscopic) Main idea: all smooths contain useful information, so study “full spectrum” (i. e. all smoothing levels) Recommended reference: Lindeberg (1994)

Participant Presentations Brendan Brown Chen Shen Wesley Hamilton Topological Data Analysis