Object Orie’d Data Analysis, Last Time Cornea Data & Robust PCA –Elliptical PCA Big Picture PCA –Optimization View –Gaussian Likelihood View –Correlation.

Slides:



Advertisements
Similar presentations
Sales Forecasting using Dynamic Bayesian Networks Steve Djajasaputra SNN Nijmegen The Netherlands.
Advertisements

Very simple to create with each dot representing a data value. Best for non continuous data but can be made for and quantitative data 2004 US Womens Soccer.
Introduction to Non Parametric Statistics Kernel Density Estimation.
Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When.
Toyota InfoTechnology Center U.S.A, Inc. 1 Mixture Models of End-host Network Traffic John Mark Agosta, Jaideep Chandrashekar, Mark Crovella, Nina Taft.
Further Maths Hello 2014 Univariate Data. What is data? Data is another word for information. By studying data, we are able to display the information.
Mr F’s Maths Notes Statistics 8. Scatter Diagrams.
The Digital Image.
Introduction to Behavioral Statistics Measurement The assignment of numerals to objects or events according to a set of rules. The rules used define.
Chapter 3: Central Tendency. Central Tendency In general terms, central tendency is a statistical measure that determines a single value that accurately.
Mean-shift and its application for object tracking
Quantitative Skills 1: Graphing
FUNCTIONS AND MODELS 1. In this section, we assume that you have access to a graphing calculator or a computer with graphing software. FUNCTIONS AND MODELS.
1 Introduction to Dijet Resonance Search Exercise John Paul Chou, Eva Halkiadakis, Robert Harris, Kalanand Mishra and Jason St. John CMS Data Analysis.
THE MANAGEMENT AND CONTROL OF QUALITY, 5e, © 2002 South-Western/Thomson Learning TM 1 Chapter 9 Statistical Thinking and Applications.
Object Orie’d Data Analysis, Last Time Statistical Smoothing –Histograms – Density Estimation –Scatterplot Smoothing – Nonpar. Regression SiZer Analysis.
Object Orie’d Data Analysis, Last Time Cornea Data & Robust PCA –Elliptical PCA Big Picture PCA –Optimization View –Gaussian Likelihood View –Correlation.
Robust PCA Robust PCA 3: Spherical PCA. Robust PCA.
TYPES OF STATISTICAL METHODS USED IN PSYCHOLOGY Statistics.
Visual Displays of Data Chapter 3. Uses of Graphs Positive and negative uses – Can accurately and succinctly present information – Can reveal/conceal.
Object Orie’d Data Analysis, Last Time Cornea Data –Images (on the disk) as data objects –Zernike basis representations Outliers in PCA (have major influence)
Graphing Data Box and whiskers plot Bar Graph Double Bar Graph Histograms Line Plots Circle Graphs.
SiZer Background Scale Space – Idea from Computer Vision Goal: Teach Computers to “See” Modern Research: Extract “Information” from Images Early Theoretical.
More Examples: There are 4 security checkpoints. The probability of being searched at any one is 0.2. You may be searched more than once in total and all.
Return to Big Picture Main statistical goals of OODA: Understanding population structure –Low dim ’ al Projections, PCA … Classification (i. e. Discrimination)
Administrative Matters Midterm II Results Take max of two midterm scores:
Stat 31, Section 1, Last Time Course Organization & Website What is Statistics? Data types.
1 Hailuoto Workshop A Statistician ’ s Adventures in Internetland J. S. Marron Department of Statistics and Operations Research University of North Carolina.
1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, I J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina.
Simple Inference in Exploratory Data Analysis: SiZer J. S. Marron Operations Research and Indust. Eng. Cornell University & Department of Statistics University.
GWAS Data Analysis. L1 PCA Challenge: L1 Projections Hard to Interpret (i.e. Little Data Insight) Solution: 1)Compute PC Directions Using L1 2)Compute.
The Normal Approximation for Data. History The normal curve was discovered by Abraham de Moivre around Around 1870, the Belgian mathematician Adolph.
Statistical Smoothing In 1 Dimension (Numbers as Data Objects)
Describing Data Week 1 The W’s (Where do the Numbers come from?) Who: Who was measured? By Whom: Who did the measuring What: What was measured? Where:
Week 2 Normal Distributions, Scatter Plots, Regression and Random.
1 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Machine learning, pattern recognition and statistical data modelling.
Statistical Smoothing
Last Time Proportions Continuous Random Variables Probabilities
Different Types of Data
SiZer Background Finance "tick data":
IE469 Industrial Applications of Operations Research
Jean Ballet, CEA Saclay GSFC, 31 May 2006 All-sky source search
The Cepstral Bend on Fourier
Radial DWD Main Idea: Linear Classifiers Good When Each Class Lives in a Distinct Region Hard When All Members Of One Class Are Outliers in a Random Direction.
(will study more later)
Topography of Head Direction Cells in Medial Entorhinal Cortex
ID1050– Quantitative & Qualitative Reasoning
Sensitivity of RNA‐seq.
Predicting Every Spike
Araceli Ramirez-Cardenas, Maria Moskaleva, Andreas Nieder 
Efficient Receptive Field Tiling in Primate V1
Two-Dimensional Substructure of MT Receptive Fields
Volume 96, Issue 5, Pages e4 (December 2017)
Complementary Roles for Primate Frontal and Parietal Cortex in Guarding Working Memory from Distractor Stimuli  Simon Nikolas Jacob, Andreas Nieder  Neuron 
Volume 111, Issue 2, Pages (July 2016)
How to Start This PowerPoint® Tutorial
A Switching Observer for Human Perceptual Estimation
Volume 71, Issue 4, Pages (August 2011)
Lisa M. Fenk, Andreas Poehlmann, Andrew D. Straw  Current Biology 
A Switching Observer for Human Perceptual Estimation
Origin and Function of Tuning Diversity in Macaque Visual Cortex
Testing the Fit of a Quantal Model of Neurotransmission
Serial, Covert Shifts of Attention during Visual Search Are Reflected by the Frontal Eye Fields and Correlated with Population Oscillations  Timothy J.
Ethan S. Bromberg-Martin, Okihide Hikosaka  Neuron 
by Kenneth W. Latimer, Jacob L. Yates, Miriam L. R
Encoding of Stimulus Probability in Macaque Inferior Temporal Cortex
Volume 23, Issue 21, Pages (November 2013)
Hippocampal-Prefrontal Theta Oscillations Support Memory Integration
Efficient Receptive Field Tiling in Primate V1
Volume 16, Issue 4, Pages (July 2016)
Presentation transcript:

Object Orie’d Data Analysis, Last Time Cornea Data & Robust PCA –Elliptical PCA Big Picture PCA –Optimization View –Gaussian Likelihood View –Correlation PCA Finding Clusters with PCA –Can be useful –But can also miss some

Participant Presentations 36 participants ~ 15 minutes each 4 per class meeting Requires 9 class meetings … Have revised Schedule on Class Web PageClass Web Page Please sign up ( ) for preferred time (1 st come, 1 st served) Please help for next week (if you can just pull something off the shelf)

PCA to find clusters PCA of Mass Flux Data:

PCA to find clusters Return to Investigation of PC1 Clusters: Can see 3 bumps in smooth histogram Main Question: Important structure or sampling variability? Approach: SiZer (SIgnificance of ZERo crossings of deriv.)

SiZer Background Two Major Settings: 2-d scatterplot smoothing 1-d histograms (continuous data, not discrete bar plots) Central Question: Which features are really there? Solution, Part 1: Scale space Solution, Part 2: SiZer

SiZer Background Bralower ’ s Fossil Data - Global Climate

SiZer Background Smooths - Suggest Structure - Real?

SiZer Background Smooths of Fossil Data (details given later) Dotted line: undersmoothed (feels sampling variability) Dashed line: oversmoothed (important features missed?) Solid line: smoothed about right? Central question: Which features are “ really there ” ?

SiZer Background Smoothing Setting 2: Histograms Family Income Data: British Family Expenditure Survey For the year 1975 Distribution of Family Incomes ~ 7000 families

SiZer Background Family Income Data, Histogram Analysis:Histogram Analysis Again under- and over- smoothing issues Perhaps 2 modes in data? Histogram Problem 1: Binwidth (well known) Central question: Which features are “ really there ” ? e.g. 2 modes? Same problem as existence of “ clusters ” in PCA

SiZer Background Why not use (conventional) histograms? Histogram Problem 2: Bin shiftBin shift (less well known) For same binwidth Get much different impression By only “ shifting grid location “ Get it right by chance?

SiZer Background Why not use (conventional) histograms? Solution to binshift problem: Average over all shifts 1st peak all in one bin: bimodal 1st peak split between bins: unimodal Smooth histo ’ m provides understanding, So should use for data analysis Another name: Kernel Density Estimate

SiZer Background Why not use (conventional) histograms? Solution to binshift problem: Average over all shifts 1st peak all in one bin: bimodal 1st peak split between bins: unimodal Smooth histo ’ m provides understanding, So should use for data analysis Another name: Kernel Density Estimate

SiZer Background Kernel density estimation Recommended Reference ( many books): Wand, M. P. and Jones, M. C. (1995) View 1: Smooth histogram View 2: Distribute probability mass, according to data E.g. Chondrite dataChondrite data (from how many sources?)

SiZer Background Kernel density estimation (cont.) Central Issue: width of window, i.e. “ bandwidth ”, E.g. Incomes data:Incomes data Controls critical amount of smoothing Old Approach: Data based bandwidth selection Recommended reference: Jones et al (1996) New Approach: "scale space" (look at all of them)

SiZer Background Scale Space – Idea from Computer Vision Conceptual basis: Oversmoothing = “ view from afar ” (macroscopic) Undersmoothing = “ zoomed in view ” (microscopic) Main idea: all smooths contain useful information, so study “ full spectrum ” (i. e. all smoothing levels) Recommended reference: Lindeberg (1994)

SiZer Background Fun Scale Spaces Views (of Family Incomes Data) Spectrum Movie

SiZer Background Fun Scale Spaces Views (Incomes Data) Spectrum Overlay

SiZer Background Fun Scale Spaces Views (Incomes Data) Surface View

SiZer Background Fun Scale Spaces Views (of Family Incomes Data) Note: The scale space viewpoint makes Data Dased Bandwidth Selection Much less important (than I once thought ….)

SiZer Background SiZer: Significance of Zero crossings, of the derivative, in scale space Combines: –needed statistical inference –novel visualization To get: a powerful exploratory data analysis method Main reference: Chaudhuri & Marron (1999)

SiZer Background Basic idea: a bump is characterized by: an increase followed by a decrease Generalization: Many features of interest captured by sign of the slope of the smooth Foundation of SiZer: Statistical inference on slopes, over scale space

SiZer Background SiZer Visual presentation: Color map over scale space: Blue: slope significantly upwards (derivative CI above 0) Red: slope significantly downwards (derivative CI below 0) Purple: slope insignificant (derivative CI contains 0)

SiZer Background SiZer analysis of Fossils data:Fossils data Upper Left: Scatterplot, family of smooths, 1 highlighted Upper Right: Scale space rep ’ n of family, with SiZer colors Lower Left: SiZer map, more easy to view Lower Right: SiCon map – replace slope by curvature Slider (in movie viewer) highlights different smoothing levels

SiZer Background SiZer analysis of Fossils data (cont.):Fossils data Oversmoothed (top of SiZer map): Decreases at left, not on right Medium smoothed (middle of SiZer map): Main valley significant, and left most increase Smaller valley not statistically significant Undersmoothed (bottom of SiZer map): “ noise wiggles ” not significant Additional SiZer color: gray - not enough data for inference

SiZer Background SiZer analysis of Fossils data (cont.):Fossils data Common Question: Which is right? Decreases on left, then flat (top of SiZer map) Up, then down, then up again (middle of SiZer map) No significant features (bottom of SiZer map) Answer: All are right Just different scales of view, i.e. levels of resolution of data

SiZer Background SiZer analysis of British Incomes data:British Incomes data Oversmoothed: Only one mode Medium smoothed: Two modes, statistically significant Confirmed by Schmitz & Marron, (1992) Undersmoothed: many noise wiggles, not significant Again: all are correct, just different scales

SiZer Background Historical Note & Acknowledgements: Scale Space: S. M. Pizer SiZer: Probal Chaudhuri Main Reference: Chaudhuri & Marron (1999)

SiZer Background Toy E.g. - Marron & Wand Trimodal #9 Increasing n Only 1 Signif ’ t Mode Now 2 Signif ’ t Modes Finally all 3 Modes

SiZer Background E.g. - Marron & Wand Discrete Comb #15 Increasing n Coarse Bumps Only Now Fine bumps too Someday: “ draw ” local Bandwidth on SiZer map

SiZer Background Finance "tick data":tick data (time, price) of single stock transactions Idea: "on line" version of SiZer for viewing and understanding trends Notes: trends depend heavily on scale double points and more background color transition (flop over at top)

SiZer Background Internet traffic data analysis: SiZer analysis of time series of packet times at internet hub (UNC) across very wide range of scales needs more pixels than screen allows thus do zooming view (zoom in over time) –zoom in to yellow bd ’ ry in next frame –readjust vertical axis

SiZer Background Internet traffic data analysis (cont.) Insights from SiZer analysis: Coarse scales: amazing amount of significant structure Evidence of self-similar fractal type process? Fewer significant features at small scales But they exist, so not Poisson process Poisson approximation OK at small scale??? Smooths (top part) stable at large scales?

Dependent SiZer Park, Marron, and Rondonotti (2004) SiZer compares data with white noise Inappropriate in time series Dependent SiZer compares data with an assumed model Visual Goodness of Fit test

Dep ’ ent SiZer : 2002 Apr 13 Sat 1 pm – 3 pm

Zoomed view (to red region, i.e. “ flat top ” )

Further Zoom: finds very periodic behavior!

Possible Physical Explanation IP “ Port Scan ” Common device of hackers Searching for “ break in points ” Send query to every possible (within UNC domain): –IP address –Port Number Replies can indicate system weaknesses Internet Traffic is hard to model